Problem statement¶
Problem Statement - Part I This assignment contains two parts. Part-I is a programming assignment (to be submitted in a Jupyter Notebook), and Part-II includes subjective questions (to be submitted in a PDF file).
Part-II is given on the next page.
Assignment Part-I A US-based housing company named Surprise Housing has decided to enter the Australian market. The company uses data analytics to purchase houses at a price below their actual values and flip them on at a higher price. For the same purpose, the company has collected a data set from the sale of houses in Australia. The data is provided in the CSV file below.
The company is looking at prospective properties to buy to enter the market. You are required to build a regression model using regularisation in order to predict the actual value of the prospective properties and decide whether to invest in them or not.
The company wants to know:
Which variables are significant in predicting the price of a house, and
How well those variables describe the price of a house.
Also, determine the optimal value of lambda for ridge and lasso regression.
Business Goal
You are required to model the price of houses with the available independent variables. This model will then be used by the management to understand how exactly the prices vary with the variables. They can accordingly manipulate the strategy of the firm and concentrate on areas that will yield high returns. Further, the model will be a good way for management to understand the pricing dynamics of a new market.
######## Importing necessary libraries and reading data </h2>
import pandas as pd #Data Processing
import numpy as np #Linear Algebra
import seaborn as sns #Data Visualization
import matplotlib.pyplot as plt #Data Visualization
import warnings #Warnings
warnings.filterwarnings ("ignore") #Warnings
pd.set_option('display.max_rows', None)# to display all the rows
#pd.options.display.float_format = '{:.2f}'.format
df_house = pd.read_csv("train.csv")
df_house.head()
| Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 208500 |
| 1 | 2 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 181500 |
| 2 | 3 | 60 | RL | 68.0 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 223500 |
| 3 | 4 | 70 | RL | 60.0 | 9550 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Abnorml | 140000 |
| 4 | 5 | 60 | RL | 84.0 | 14260 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 12 | 2008 | WD | Normal | 250000 |
5 rows × 81 columns
df_house.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1460 entries, 0 to 1459 Data columns (total 81 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Id 1460 non-null int64 1 MSSubClass 1460 non-null int64 2 MSZoning 1460 non-null object 3 LotFrontage 1201 non-null float64 4 LotArea 1460 non-null int64 5 Street 1460 non-null object 6 Alley 91 non-null object 7 LotShape 1460 non-null object 8 LandContour 1460 non-null object 9 Utilities 1460 non-null object 10 LotConfig 1460 non-null object 11 LandSlope 1460 non-null object 12 Neighborhood 1460 non-null object 13 Condition1 1460 non-null object 14 Condition2 1460 non-null object 15 BldgType 1460 non-null object 16 HouseStyle 1460 non-null object 17 OverallQual 1460 non-null int64 18 OverallCond 1460 non-null int64 19 YearBuilt 1460 non-null int64 20 YearRemodAdd 1460 non-null int64 21 RoofStyle 1460 non-null object 22 RoofMatl 1460 non-null object 23 Exterior1st 1460 non-null object 24 Exterior2nd 1460 non-null object 25 MasVnrType 588 non-null object 26 MasVnrArea 1452 non-null float64 27 ExterQual 1460 non-null object 28 ExterCond 1460 non-null object 29 Foundation 1460 non-null object 30 BsmtQual 1423 non-null object 31 BsmtCond 1423 non-null object 32 BsmtExposure 1422 non-null object 33 BsmtFinType1 1423 non-null object 34 BsmtFinSF1 1460 non-null int64 35 BsmtFinType2 1422 non-null object 36 BsmtFinSF2 1460 non-null int64 37 BsmtUnfSF 1460 non-null int64 38 TotalBsmtSF 1460 non-null int64 39 Heating 1460 non-null object 40 HeatingQC 1460 non-null object 41 CentralAir 1460 non-null object 42 Electrical 1459 non-null object 43 1stFlrSF 1460 non-null int64 44 2ndFlrSF 1460 non-null int64 45 LowQualFinSF 1460 non-null int64 46 GrLivArea 1460 non-null int64 47 BsmtFullBath 1460 non-null int64 48 BsmtHalfBath 1460 non-null int64 49 FullBath 1460 non-null int64 50 HalfBath 1460 non-null int64 51 BedroomAbvGr 1460 non-null int64 52 KitchenAbvGr 1460 non-null int64 53 KitchenQual 1460 non-null object 54 TotRmsAbvGrd 1460 non-null int64 55 Functional 1460 non-null object 56 Fireplaces 1460 non-null int64 57 FireplaceQu 770 non-null object 58 GarageType 1379 non-null object 59 GarageYrBlt 1379 non-null float64 60 GarageFinish 1379 non-null object 61 GarageCars 1460 non-null int64 62 GarageArea 1460 non-null int64 63 GarageQual 1379 non-null object 64 GarageCond 1379 non-null object 65 PavedDrive 1460 non-null object 66 WoodDeckSF 1460 non-null int64 67 OpenPorchSF 1460 non-null int64 68 EnclosedPorch 1460 non-null int64 69 3SsnPorch 1460 non-null int64 70 ScreenPorch 1460 non-null int64 71 PoolArea 1460 non-null int64 72 PoolQC 7 non-null object 73 Fence 281 non-null object 74 MiscFeature 54 non-null object 75 MiscVal 1460 non-null int64 76 MoSold 1460 non-null int64 77 YrSold 1460 non-null int64 78 SaleType 1460 non-null object 79 SaleCondition 1460 non-null object 80 SalePrice 1460 non-null int64 dtypes: float64(3), int64(35), object(43) memory usage: 924.0+ KB
df_house.shape
(1460, 81)
EDA¶
print("duplicate rows : ",df_house.duplicated().sum())
print("columns in which all values are null",df_house.isnull().all(axis=1).sum())
print("rows in which all values are null",df_house.isnull().all(axis=0).sum())
duplicate rows : 0 columns in which all values are null 0 rows in which all values are null 0
100*df_house.isnull().mean().sort_values(ascending=False)#rows in which most of the values are null
PoolQC 99.520548 MiscFeature 96.301370 Alley 93.767123 Fence 80.753425 MasVnrType 59.726027 FireplaceQu 47.260274 LotFrontage 17.739726 GarageYrBlt 5.547945 GarageCond 5.547945 GarageType 5.547945 GarageFinish 5.547945 GarageQual 5.547945 BsmtFinType2 2.602740 BsmtExposure 2.602740 BsmtQual 2.534247 BsmtCond 2.534247 BsmtFinType1 2.534247 MasVnrArea 0.547945 Electrical 0.068493 Id 0.000000 Functional 0.000000 Fireplaces 0.000000 KitchenQual 0.000000 KitchenAbvGr 0.000000 BedroomAbvGr 0.000000 HalfBath 0.000000 FullBath 0.000000 BsmtHalfBath 0.000000 TotRmsAbvGrd 0.000000 GarageCars 0.000000 GrLivArea 0.000000 GarageArea 0.000000 PavedDrive 0.000000 WoodDeckSF 0.000000 OpenPorchSF 0.000000 EnclosedPorch 0.000000 3SsnPorch 0.000000 ScreenPorch 0.000000 PoolArea 0.000000 MiscVal 0.000000 MoSold 0.000000 YrSold 0.000000 SaleType 0.000000 SaleCondition 0.000000 BsmtFullBath 0.000000 HeatingQC 0.000000 LowQualFinSF 0.000000 LandSlope 0.000000 OverallQual 0.000000 HouseStyle 0.000000 BldgType 0.000000 Condition2 0.000000 Condition1 0.000000 Neighborhood 0.000000 LotConfig 0.000000 YearBuilt 0.000000 Utilities 0.000000 LandContour 0.000000 LotShape 0.000000 Street 0.000000 LotArea 0.000000 MSZoning 0.000000 OverallCond 0.000000 YearRemodAdd 0.000000 2ndFlrSF 0.000000 BsmtFinSF2 0.000000 1stFlrSF 0.000000 CentralAir 0.000000 MSSubClass 0.000000 Heating 0.000000 TotalBsmtSF 0.000000 BsmtUnfSF 0.000000 BsmtFinSF1 0.000000 RoofStyle 0.000000 Foundation 0.000000 ExterCond 0.000000 ExterQual 0.000000 Exterior2nd 0.000000 Exterior1st 0.000000 RoofMatl 0.000000 SalePrice 0.000000 dtype: float64
Removing columns which have more than 30 % missing values as they are not beneficial, because even if we impute them, most of the values remain same and which will not help in our analysis.¶
df_house.shape
(1460, 81)
rm_cols = df_house.columns[df_house.isnull().mean() * 100 >= 30.00 ]
df_house.drop(rm_cols, axis=1, inplace=True)
df_house.shape
(1460, 75)
#Validating the null values again
100*df_house.isnull().mean().sort_values(ascending=False)
LotFrontage 17.739726 GarageYrBlt 5.547945 GarageCond 5.547945 GarageType 5.547945 GarageFinish 5.547945 GarageQual 5.547945 BsmtFinType2 2.602740 BsmtExposure 2.602740 BsmtFinType1 2.534247 BsmtCond 2.534247 BsmtQual 2.534247 MasVnrArea 0.547945 Electrical 0.068493 WoodDeckSF 0.000000 PavedDrive 0.000000 LowQualFinSF 0.000000 GrLivArea 0.000000 BsmtFullBath 0.000000 BsmtHalfBath 0.000000 FullBath 0.000000 HalfBath 0.000000 SaleCondition 0.000000 BedroomAbvGr 0.000000 SaleType 0.000000 YrSold 0.000000 MoSold 0.000000 MiscVal 0.000000 KitchenAbvGr 0.000000 KitchenQual 0.000000 TotRmsAbvGrd 0.000000 PoolArea 0.000000 Functional 0.000000 Fireplaces 0.000000 ScreenPorch 0.000000 2ndFlrSF 0.000000 3SsnPorch 0.000000 GarageCars 0.000000 GarageArea 0.000000 EnclosedPorch 0.000000 OpenPorchSF 0.000000 Id 0.000000 Heating 0.000000 1stFlrSF 0.000000 OverallCond 0.000000 MSZoning 0.000000 LotArea 0.000000 Street 0.000000 LotShape 0.000000 LandContour 0.000000 Utilities 0.000000 LotConfig 0.000000 LandSlope 0.000000 Neighborhood 0.000000 Condition1 0.000000 Condition2 0.000000 BldgType 0.000000 HouseStyle 0.000000 OverallQual 0.000000 YearBuilt 0.000000 CentralAir 0.000000 YearRemodAdd 0.000000 RoofStyle 0.000000 RoofMatl 0.000000 Exterior1st 0.000000 Exterior2nd 0.000000 ExterQual 0.000000 ExterCond 0.000000 Foundation 0.000000 BsmtFinSF1 0.000000 BsmtFinSF2 0.000000 BsmtUnfSF 0.000000 TotalBsmtSF 0.000000 MSSubClass 0.000000 HeatingQC 0.000000 SalePrice 0.000000 dtype: float64
Replace the missing values for numarical columns¶
null_cols = df_house.columns[df_house.isnull().any()]
df_house[null_cols].info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1460 entries, 0 to 1459 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 LotFrontage 1201 non-null float64 1 MasVnrArea 1452 non-null float64 2 BsmtQual 1423 non-null object 3 BsmtCond 1423 non-null object 4 BsmtExposure 1422 non-null object 5 BsmtFinType1 1423 non-null object 6 BsmtFinType2 1422 non-null object 7 Electrical 1459 non-null object 8 GarageType 1379 non-null object 9 GarageYrBlt 1379 non-null float64 10 GarageFinish 1379 non-null object 11 GarageQual 1379 non-null object 12 GarageCond 1379 non-null object dtypes: float64(3), object(10) memory usage: 148.4+ KB
df_house['LotFrontage'].describe(percentiles=[.25, .5, .75, .90, .95, .99])
count 1201.000000 mean 70.049958 std 24.284752 min 21.000000 25% 59.000000 50% 69.000000 75% 80.000000 90% 96.000000 95% 107.000000 99% 141.000000 max 313.000000 Name: LotFrontage, dtype: float64
print(df_house['MasVnrArea'].describe(percentiles=[.25, .5, .75, .90, .95, .99]))
print(df_house['MasVnrArea'].median())
print(df_house['GarageYrBlt'].describe(percentiles=[.25, .5, .75, .90, .95, .99]))
print(df_house['GarageYrBlt'].median())
count 1452.000000 mean 103.685262 std 181.066207 min 0.000000 25% 0.000000 50% 0.000000 75% 166.000000 90% 335.000000 95% 456.000000 99% 791.920000 max 1600.000000 Name: MasVnrArea, dtype: float64 0.0 count 1379.000000 mean 1978.506164 std 24.689725 min 1900.000000 25% 1961.000000 50% 1980.000000 75% 2002.000000 90% 2006.000000 95% 2007.000000 99% 2009.000000 max 2010.000000 Name: GarageYrBlt, dtype: float64 1980.0
df_house[["LotFrontage", "MasVnrArea",'GarageYrBlt']].plot(kind= "box", subplots= True)
plt.show()
df_house[["LotFrontage", "MasVnrArea",'GarageYrBlt']].median()
LotFrontage 69.0 MasVnrArea 0.0 GarageYrBlt 1980.0 dtype: float64
# We will replace the null with median for LotFrontage MasVnrArea and GarageYrBlt columns
df_house["LotFrontage"].fillna(df_house["LotFrontage"].median(), inplace=True)
df_house["MasVnrArea"].fillna(df_house["MasVnrArea"].mean(), inplace=True)
df_house["GarageYrBlt"].fillna(df_house["GarageYrBlt"].median(), inplace=True)
Replaceing categorical columns¶
# Electrical system type Filling the Electrical with the mode
df_house['Electrical'] = df_house['Electrical'].fillna(df_house['Electrical'].mode()[0])
#Validating the null values again
df_house.columns[df_house.isnull().any()]
Index(['BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2',
'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond'],
dtype='object')
#As mentioned in the Data Dictionary NA value means it is not present and thus we can replace it with none
null_with_meaning = ["BsmtQual", "BsmtCond", "BsmtExposure", "BsmtFinType1", "BsmtFinType2", "GarageType", "GarageFinish", "GarageQual", "GarageCond"]
for i in null_with_meaning:
df_house[i].fillna("none", inplace=True)
null_cols = df_house.columns[df_house.isnull().any()]
df_house[null_cols].info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1460 entries, 0 to 1459 Empty DataFrame
df_house.drop('Id',axis=1,inplace=True)
#df_house = df_house.round(decimals = 2)
df_house.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1460 entries, 0 to 1459 Data columns (total 74 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 MSSubClass 1460 non-null int64 1 MSZoning 1460 non-null object 2 LotFrontage 1460 non-null float64 3 LotArea 1460 non-null int64 4 Street 1460 non-null object 5 LotShape 1460 non-null object 6 LandContour 1460 non-null object 7 Utilities 1460 non-null object 8 LotConfig 1460 non-null object 9 LandSlope 1460 non-null object 10 Neighborhood 1460 non-null object 11 Condition1 1460 non-null object 12 Condition2 1460 non-null object 13 BldgType 1460 non-null object 14 HouseStyle 1460 non-null object 15 OverallQual 1460 non-null int64 16 OverallCond 1460 non-null int64 17 YearBuilt 1460 non-null int64 18 YearRemodAdd 1460 non-null int64 19 RoofStyle 1460 non-null object 20 RoofMatl 1460 non-null object 21 Exterior1st 1460 non-null object 22 Exterior2nd 1460 non-null object 23 MasVnrArea 1460 non-null float64 24 ExterQual 1460 non-null object 25 ExterCond 1460 non-null object 26 Foundation 1460 non-null object 27 BsmtQual 1460 non-null object 28 BsmtCond 1460 non-null object 29 BsmtExposure 1460 non-null object 30 BsmtFinType1 1460 non-null object 31 BsmtFinSF1 1460 non-null int64 32 BsmtFinType2 1460 non-null object 33 BsmtFinSF2 1460 non-null int64 34 BsmtUnfSF 1460 non-null int64 35 TotalBsmtSF 1460 non-null int64 36 Heating 1460 non-null object 37 HeatingQC 1460 non-null object 38 CentralAir 1460 non-null object 39 Electrical 1460 non-null object 40 1stFlrSF 1460 non-null int64 41 2ndFlrSF 1460 non-null int64 42 LowQualFinSF 1460 non-null int64 43 GrLivArea 1460 non-null int64 44 BsmtFullBath 1460 non-null int64 45 BsmtHalfBath 1460 non-null int64 46 FullBath 1460 non-null int64 47 HalfBath 1460 non-null int64 48 BedroomAbvGr 1460 non-null int64 49 KitchenAbvGr 1460 non-null int64 50 KitchenQual 1460 non-null object 51 TotRmsAbvGrd 1460 non-null int64 52 Functional 1460 non-null object 53 Fireplaces 1460 non-null int64 54 GarageType 1460 non-null object 55 GarageYrBlt 1460 non-null float64 56 GarageFinish 1460 non-null object 57 GarageCars 1460 non-null int64 58 GarageArea 1460 non-null int64 59 GarageQual 1460 non-null object 60 GarageCond 1460 non-null object 61 PavedDrive 1460 non-null object 62 WoodDeckSF 1460 non-null int64 63 OpenPorchSF 1460 non-null int64 64 EnclosedPorch 1460 non-null int64 65 3SsnPorch 1460 non-null int64 66 ScreenPorch 1460 non-null int64 67 PoolArea 1460 non-null int64 68 MiscVal 1460 non-null int64 69 MoSold 1460 non-null int64 70 YrSold 1460 non-null int64 71 SaleType 1460 non-null object 72 SaleCondition 1460 non-null object 73 SalePrice 1460 non-null int64 dtypes: float64(3), int64(34), object(37) memory usage: 844.2+ KB
Deriving variables¶
#Overall area for all floors and basement plays an important role, hence creating total area in square foot column
df_house['Total_sqr_footage'] = (df_house['BsmtFinSF1'] + df_house['BsmtFinSF2'] + df_house['1stFlrSF'] + df_house['2ndFlrSF']) + df_house['GrLivArea']
# Creating derived column for total number of bathrooms column
df_house['Total_Bathrooms'] = (df_house['FullBath'] + (0.5 * df_house['HalfBath']) + df_house['BsmtFullBath'] + (0.5 * df_house['BsmtHalfBath']))
#Creating derived column for total porch area
df_house['Total_porch_sf'] = (df_house['OpenPorchSF'] + df_house['3SsnPorch'] + df_house['EnclosedPorch'] + df_house['ScreenPorch'] + df_house['WoodDeckSF'])
#Lets drop these extra columns :
extraCols = ['BsmtFinSF1','BsmtFinSF2','1stFlrSF','2ndFlrSF','GrLivArea','FullBath','HalfBath','BsmtFullBath','BsmtHalfBath','OpenPorchSF','3SsnPorch','EnclosedPorch','ScreenPorch','WoodDeckSF']
df_house.drop(extraCols,axis=1,inplace=True)
df_house.shape# verifying the shape of the dataset
(1460, 63)
# Creating a new Column to determine the age of the property
df_house['Total_Age']=df_house['YrSold']-df_house['YearBuilt']
df_house['Garage_age'] = df_house['YrSold'] - df_house['GarageYrBlt']
df_house['Remodel_age'] = df_house['YrSold'] - df_house['YearRemodAdd']
#Also lets drop out variables like GarageYrBlt and YearRemodAdd as we are already calculating the number of years
drop_cols = ['GarageYrBlt','YearRemodAdd','YearBuilt']
df_house.drop(labels = drop_cols, axis = 1, inplace=True) #Dropping the columns added in the list
print("The new size of the data is" , df_house.shape) #Printing the new Dataset Shape
The new size of the data is (1460, 63)
Correlation of Numerical columns¶
# Checking the corelation
numeric_columns = df_house.select_dtypes(include=[np.number])
plt.subplots(figsize = (25,20))
#Plotting heatmap of numerical features
sns.heatmap(round(numeric_columns.corr(),2), cmap='coolwarm' , annot=True, center = 0)
plt.show()
numeric_columns = df_house.select_dtypes(include=[np.number])
important_num_cols = list(numeric_columns.corr()["SalePrice"][(numeric_columns.corr()["SalePrice"]>0.50) | (numeric_columns.corr()["SalePrice"]<-0.50)].index)
important_num_cols #columns highly correlated with SalePRice
['OverallQual', 'TotalBsmtSF', 'TotRmsAbvGrd', 'GarageCars', 'GarageArea', 'SalePrice', 'Total_sqr_footage', 'Total_Bathrooms', 'Total_Age', 'Remodel_age']
plt.figure(figsize = (70, 90))
sns.pairplot(df_house, vars= important_num_cols)
plt.show()
<Figure size 7000x9000 with 0 Axes>
Outlier Analysis¶
#Lets divide the Columns based on Numerical/continous and categorical
Cat_cols = []
Num_cols = []
for i in df_house.columns :
if df_house[i].dtype == "object":
Cat_cols.append(i)
else:
Num_cols.append(i)
cat_info_df = pd.DataFrame({
'Categorical Column': Cat_cols,
'Info': [df_house[col].dtypes for col in Cat_cols],
'Num Unique': [df_house[col].nunique() for col in Cat_cols]
})
print(cat_info_df)
num_info_df = pd.DataFrame({
'Numerical Column': Num_cols,
'Info': [df_house[col].dtypes for col in Num_cols],
'Num Unique': [df_house[col].nunique() for col in Num_cols]
})
print(num_info_df)
Categorical Column Info Num Unique
0 MSZoning object 5
1 Street object 2
2 LotShape object 4
3 LandContour object 4
4 Utilities object 2
5 LotConfig object 5
6 LandSlope object 3
7 Neighborhood object 25
8 Condition1 object 9
9 Condition2 object 8
10 BldgType object 5
11 HouseStyle object 8
12 RoofStyle object 6
13 RoofMatl object 8
14 Exterior1st object 15
15 Exterior2nd object 16
16 ExterQual object 4
17 ExterCond object 5
18 Foundation object 6
19 BsmtQual object 5
20 BsmtCond object 5
21 BsmtExposure object 5
22 BsmtFinType1 object 7
23 BsmtFinType2 object 7
24 Heating object 6
25 HeatingQC object 5
26 CentralAir object 2
27 Electrical object 5
28 KitchenQual object 4
29 Functional object 7
30 GarageType object 7
31 GarageFinish object 4
32 GarageQual object 6
33 GarageCond object 6
34 PavedDrive object 3
35 SaleType object 9
36 SaleCondition object 6
Numerical Column Info Num Unique
0 MSSubClass int64 15
1 LotFrontage float64 110
2 LotArea int64 1073
3 OverallQual int64 10
4 OverallCond int64 9
5 MasVnrArea float64 328
6 BsmtUnfSF int64 780
7 TotalBsmtSF int64 721
8 LowQualFinSF int64 24
9 BedroomAbvGr int64 8
10 KitchenAbvGr int64 4
11 TotRmsAbvGrd int64 12
12 Fireplaces int64 4
13 GarageCars int64 5
14 GarageArea int64 441
15 PoolArea int64 8
16 MiscVal int64 21
17 MoSold int64 12
18 YrSold int64 5
19 SalePrice int64 663
20 Total_sqr_footage int64 1124
21 Total_Bathrooms float64 10
22 Total_porch_sf int64 427
23 Total_Age int64 122
24 Garage_age float64 101
25 Remodel_age int64 62
#Lets plot SalePrice against all the categorical columns
plt.figure(figsize=(30,80))#The size of the plot
c=0
num_cols = 3
num_rows = (len(Cat_cols) - 1) // num_cols + 1 # Calculate the number of rows needed
for i in Cat_cols:
c=c+1
plt.subplot(num_rows,num_cols,c)
sns.boxplot(x = 'SalePrice', y = df_house[str(i)], data = df_house)
plt.title(str(i)+" VS SalePrice - Box Plot\n",fontsize=20)#The title of the plot
plt.tight_layout()#to avoid overlapping layout
plt.show()#to display the plot
#As we can observe there are outlier in many columns, listing them below
outlier = ['LotFrontage','LotArea','Total_sqr_footage','Total_porch_sf']
for i in outlier:
qnt = df_house[i].quantile(0.98)#removing data above 98 percentile
df_house = df_house[df_house[i] < qnt]
df_house.shape
(1343, 63)
Data Preparation¶
Creating dummy columns for categorical columns¶
df_house = pd.get_dummies(df_house,drop_first=True)
df_house.info()#displaying the updated Datatypes
<class 'pandas.core.frame.DataFrame'> Index: 1343 entries, 0 to 1458 Columns: 225 entries, MSSubClass to SaleCondition_Partial dtypes: bool(199), float64(4), int64(22) memory usage: 544.3 KB
df_house.head()
| MSSubClass | LotFrontage | LotArea | OverallQual | OverallCond | MasVnrArea | BsmtUnfSF | TotalBsmtSF | LowQualFinSF | BedroomAbvGr | ... | SaleType_ConLI | SaleType_ConLw | SaleType_New | SaleType_Oth | SaleType_WD | SaleCondition_AdjLand | SaleCondition_Alloca | SaleCondition_Family | SaleCondition_Normal | SaleCondition_Partial | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 60 | 65.0 | 8450 | 7 | 5 | 196.0 | 150 | 856 | 0 | 3 | ... | False | False | False | False | True | False | False | False | True | False |
| 1 | 20 | 80.0 | 9600 | 6 | 8 | 0.0 | 284 | 1262 | 0 | 3 | ... | False | False | False | False | True | False | False | False | True | False |
| 2 | 60 | 68.0 | 11250 | 7 | 5 | 162.0 | 434 | 920 | 0 | 3 | ... | False | False | False | False | True | False | False | False | True | False |
| 3 | 70 | 60.0 | 9550 | 7 | 5 | 0.0 | 540 | 756 | 0 | 3 | ... | False | False | False | False | True | False | False | False | False | False |
| 4 | 60 | 84.0 | 14260 | 8 | 5 | 350.0 | 490 | 1145 | 0 | 4 | ... | False | False | False | False | True | False | False | False | True | False |
5 rows × 225 columns
bool_columns = df_house.select_dtypes(include=['bool'])
def map_bool(col):
return col.map({True:1,False:0})
df_house[bool_columns.columns] = df_house[bool_columns.columns].apply(map_bool)
df_house.info()
<class 'pandas.core.frame.DataFrame'> Index: 1343 entries, 0 to 1458 Columns: 225 entries, MSSubClass to SaleCondition_Partial dtypes: float64(4), int64(221) memory usage: 2.3 MB
plt.figure(figsize=(16,6))
sns.distplot(df_house.SalePrice)
plt.show()
df_house.shape
(1343, 225)
Dividing the Data in terms of TRAIN and TEST.¶
# Using Sklearn and stats model for modeling
from sklearn.model_selection import train_test_split #for spliting the data in terms of train and test
from sklearn.preprocessing import MinMaxScaler #for performing minmax scaling on the continous variables of training data
from sklearn.feature_selection import RFE #for performing automated Feature Selection
from sklearn.linear_model import LinearRegression #to build linear model
from sklearn.linear_model import Ridge #for ridge regularization
from sklearn.linear_model import Lasso #for lasso regularization
from sklearn.model_selection import GridSearchCV #finding the optimal parameter values
from sklearn.metrics import r2_score #for calculating the r-square value
import statsmodels.api as sm #for add the constant value
from sklearn import metrics
from statsmodels.stats.outliers_influence import variance_inflation_factor #to calculate the VIF
from sklearn.metrics import mean_squared_error #for calculating the mean squared error
#Here we will keep the train size as 70% and automatically test size will be rest 30%
#Also we will keep random_state as a fixed value of 100 so that the data set does no changes
df_train,df_test = train_test_split(df_house, train_size = 0.7, random_state = 100)
print ("The Size of Train data is",df_train.shape)
print ("The Size of Test data is",df_test.shape)
The Size of Train data is (940, 225) The Size of Test data is (403, 225)
Scaling¶
Scaler = MinMaxScaler() # Instantiate an objectr
#Note-The above order of columns in num_cols should be same in df_test, otherwise we will get a wrong r-square value
df_train[Num_cols] = Scaler.fit_transform(df_train[Num_cols])
df_test[Num_cols] = Scaler.transform(df_test[Num_cols])
#Define X_train and y_train
y_train = df_train.pop('SalePrice') #This contains only the Target Variable
X_train = df_train #This contains all the Independent Variables except the Target Variable
#Since 'SalePrice' is the target variable we will keep it only on y-train and remove it from X_train
y_test = df_test.pop('SalePrice')
X_test = df_test
Model :1 Automated Process using RFE and VIF¶
X_train.shape
(940, 224)
#Fit the Model
lr = LinearRegression()
rfe = RFE(lr, n_features_to_select=100)
rfe = rfe.fit(X_train,y_train)
#View the support_ and rank_
list(zip(X_train.columns,rfe.support_,rfe.ranking_))
[('MSSubClass', False, 20),
('LotFrontage', False, 97),
('LotArea', True, 1),
('OverallQual', True, 1),
('OverallCond', True, 1),
('MasVnrArea', False, 48),
('BsmtUnfSF', True, 1),
('TotalBsmtSF', False, 8),
('LowQualFinSF', False, 55),
('BedroomAbvGr', True, 1),
('KitchenAbvGr', True, 1),
('TotRmsAbvGrd', True, 1),
('Fireplaces', False, 32),
('GarageCars', False, 119),
('GarageArea', True, 1),
('PoolArea', True, 1),
('MiscVal', True, 1),
('MoSold', False, 106),
('YrSold', False, 100),
('Total_sqr_footage', True, 1),
('Total_Bathrooms', True, 1),
('Total_porch_sf', False, 25),
('Total_Age', True, 1),
('Garage_age', False, 58),
('Remodel_age', False, 61),
('MSZoning_FV', True, 1),
('MSZoning_RH', False, 4),
('MSZoning_RL', True, 1),
('MSZoning_RM', False, 6),
('Street_Pave', True, 1),
('LotShape_IR2', False, 120),
('LotShape_IR3', False, 15),
('LotShape_Reg', False, 77),
('LandContour_HLS', False, 84),
('LandContour_Low', False, 18),
('LandContour_Lvl', False, 46),
('Utilities_NoSeWa', True, 1),
('LotConfig_CulDSac', False, 75),
('LotConfig_FR2', False, 35),
('LotConfig_FR3', False, 112),
('LotConfig_Inside', False, 114),
('LandSlope_Mod', False, 111),
('LandSlope_Sev', True, 1),
('Neighborhood_Blueste', False, 66),
('Neighborhood_BrDale', False, 90),
('Neighborhood_BrkSide', False, 122),
('Neighborhood_ClearCr', False, 52),
('Neighborhood_CollgCr', False, 22),
('Neighborhood_Crawfor', True, 1),
('Neighborhood_Edwards', False, 9),
('Neighborhood_Gilbert', False, 12),
('Neighborhood_IDOTRR', False, 73),
('Neighborhood_MeadowV', True, 1),
('Neighborhood_Mitchel', True, 1),
('Neighborhood_NAmes', False, 10),
('Neighborhood_NPkVill', False, 38),
('Neighborhood_NWAmes', False, 7),
('Neighborhood_NoRidge', True, 1),
('Neighborhood_NridgHt', True, 1),
('Neighborhood_OldTown', False, 2),
('Neighborhood_SWISU', False, 24),
('Neighborhood_Sawyer', False, 21),
('Neighborhood_SawyerW', False, 23),
('Neighborhood_Somerst', True, 1),
('Neighborhood_StoneBr', True, 1),
('Neighborhood_Timber', False, 13),
('Neighborhood_Veenker', False, 91),
('Condition1_Feedr', True, 1),
('Condition1_Norm', True, 1),
('Condition1_PosA', False, 36),
('Condition1_PosN', True, 1),
('Condition1_RRAe', True, 1),
('Condition1_RRAn', True, 1),
('Condition1_RRNe', True, 1),
('Condition1_RRNn', False, 76),
('Condition2_Feedr', False, 47),
('Condition2_Norm', False, 56),
('Condition2_PosN', True, 1),
('Condition2_RRAe', True, 1),
('Condition2_RRAn', False, 27),
('Condition2_RRNn', False, 62),
('BldgType_2fmCon', False, 83),
('BldgType_Duplex', True, 1),
('BldgType_Twnhs', True, 1),
('BldgType_TwnhsE', True, 1),
('HouseStyle_1.5Unf', False, 57),
('HouseStyle_1Story', False, 89),
('HouseStyle_2.5Fin', True, 1),
('HouseStyle_2.5Unf', False, 124),
('HouseStyle_2Story', False, 64),
('HouseStyle_SFoyer', False, 121),
('HouseStyle_SLvl', False, 117),
('RoofStyle_Gable', False, 50),
('RoofStyle_Gambrel', False, 51),
('RoofStyle_Hip', False, 49),
('RoofStyle_Mansard', True, 1),
('RoofStyle_Shed', True, 1),
('RoofMatl_Metal', True, 1),
('RoofMatl_Roll', False, 70),
('RoofMatl_Tar&Grv', True, 1),
('RoofMatl_WdShake', False, 85),
('RoofMatl_WdShngl', True, 1),
('Exterior1st_AsphShn', True, 1),
('Exterior1st_BrkComm', False, 5),
('Exterior1st_BrkFace', False, 74),
('Exterior1st_CBlock', True, 1),
('Exterior1st_CemntBd', True, 1),
('Exterior1st_HdBoard', False, 31),
('Exterior1st_ImStucc', False, 14),
('Exterior1st_MetalSd', False, 72),
('Exterior1st_Plywood', False, 28),
('Exterior1st_Stone', True, 1),
('Exterior1st_Stucco', False, 42),
('Exterior1st_VinylSd', False, 71),
('Exterior1st_Wd Sdng', True, 1),
('Exterior1st_WdShing', False, 26),
('Exterior2nd_AsphShn', True, 1),
('Exterior2nd_Brk Cmn', False, 53),
('Exterior2nd_BrkFace', True, 1),
('Exterior2nd_CBlock', True, 1),
('Exterior2nd_CmentBd', True, 1),
('Exterior2nd_HdBoard', False, 109),
('Exterior2nd_ImStucc', False, 67),
('Exterior2nd_MetalSd', False, 110),
('Exterior2nd_Other', True, 1),
('Exterior2nd_Plywood', False, 86),
('Exterior2nd_Stone', False, 40),
('Exterior2nd_Stucco', False, 33),
('Exterior2nd_VinylSd', False, 108),
('Exterior2nd_Wd Sdng', True, 1),
('Exterior2nd_Wd Shng', False, 113),
('ExterQual_Fa', False, 88),
('ExterQual_Gd', True, 1),
('ExterQual_TA', True, 1),
('ExterCond_Fa', False, 17),
('ExterCond_Gd', False, 16),
('ExterCond_Po', True, 1),
('ExterCond_TA', False, 19),
('Foundation_CBlock', False, 79),
('Foundation_PConc', False, 78),
('Foundation_Slab', False, 87),
('Foundation_Stone', False, 65),
('Foundation_Wood', True, 1),
('BsmtQual_Fa', True, 1),
('BsmtQual_Gd', True, 1),
('BsmtQual_TA', True, 1),
('BsmtQual_none', True, 1),
('BsmtCond_Gd', False, 60),
('BsmtCond_Po', True, 1),
('BsmtCond_TA', False, 59),
('BsmtCond_none', True, 1),
('BsmtExposure_Gd', True, 1),
('BsmtExposure_Mn', False, 96),
('BsmtExposure_No', False, 95),
('BsmtExposure_none', True, 1),
('BsmtFinType1_BLQ', False, 103),
('BsmtFinType1_GLQ', False, 39),
('BsmtFinType1_LwQ', False, 81),
('BsmtFinType1_Rec', False, 82),
('BsmtFinType1_Unf', False, 118),
('BsmtFinType1_none', True, 1),
('BsmtFinType2_BLQ', False, 68),
('BsmtFinType2_GLQ', False, 80),
('BsmtFinType2_LwQ', False, 93),
('BsmtFinType2_Rec', False, 92),
('BsmtFinType2_Unf', False, 69),
('BsmtFinType2_none', True, 1),
('Heating_GasA', True, 1),
('Heating_GasW', True, 1),
('Heating_Grav', True, 1),
('Heating_OthW', True, 1),
('Heating_Wall', True, 1),
('HeatingQC_Fa', False, 37),
('HeatingQC_Gd', False, 107),
('HeatingQC_Po', True, 1),
('HeatingQC_TA', False, 102),
('CentralAir_Y', False, 115),
('Electrical_FuseF', False, 99),
('Electrical_FuseP', False, 63),
('Electrical_Mix', True, 1),
('Electrical_SBrkr', False, 98),
('KitchenQual_Fa', True, 1),
('KitchenQual_Gd', True, 1),
('KitchenQual_TA', True, 1),
('Functional_Maj2', False, 29),
('Functional_Min1', False, 54),
('Functional_Min2', False, 94),
('Functional_Mod', True, 1),
('Functional_Sev', True, 1),
('Functional_Typ', True, 1),
('GarageType_Attchd', True, 1),
('GarageType_Basment', True, 1),
('GarageType_BuiltIn', True, 1),
('GarageType_CarPort', False, 3),
('GarageType_Detchd', True, 1),
('GarageType_none', True, 1),
('GarageFinish_RFn', False, 101),
('GarageFinish_Unf', False, 116),
('GarageFinish_none', True, 1),
('GarageQual_Fa', True, 1),
('GarageQual_Gd', True, 1),
('GarageQual_Po', True, 1),
('GarageQual_TA', True, 1),
('GarageQual_none', True, 1),
('GarageCond_Fa', True, 1),
('GarageCond_Gd', True, 1),
('GarageCond_Po', True, 1),
('GarageCond_TA', True, 1),
('GarageCond_none', True, 1),
('PavedDrive_P', False, 34),
('PavedDrive_Y', False, 104),
('SaleType_CWD', True, 1),
('SaleType_Con', True, 1),
('SaleType_ConLD', False, 30),
('SaleType_ConLI', False, 11),
('SaleType_ConLw', False, 123),
('SaleType_New', True, 1),
('SaleType_Oth', False, 43),
('SaleType_WD', False, 125),
('SaleCondition_AdjLand', False, 45),
('SaleCondition_Alloca', False, 105),
('SaleCondition_Family', True, 1),
('SaleCondition_Normal', False, 44),
('SaleCondition_Partial', False, 41)]
#List of columns selected by RFE
Rfe_Cols = X_train.columns[rfe.support_]
Rfe_Cols
Index(['LotArea', 'OverallQual', 'OverallCond', 'BsmtUnfSF', 'BedroomAbvGr',
'KitchenAbvGr', 'TotRmsAbvGrd', 'GarageArea', 'PoolArea', 'MiscVal',
'Total_sqr_footage', 'Total_Bathrooms', 'Total_Age', 'MSZoning_FV',
'MSZoning_RL', 'Street_Pave', 'Utilities_NoSeWa', 'LandSlope_Sev',
'Neighborhood_Crawfor', 'Neighborhood_MeadowV', 'Neighborhood_Mitchel',
'Neighborhood_NoRidge', 'Neighborhood_NridgHt', 'Neighborhood_Somerst',
'Neighborhood_StoneBr', 'Condition1_Feedr', 'Condition1_Norm',
'Condition1_PosN', 'Condition1_RRAe', 'Condition1_RRAn',
'Condition1_RRNe', 'Condition2_PosN', 'Condition2_RRAe',
'BldgType_Duplex', 'BldgType_Twnhs', 'BldgType_TwnhsE',
'HouseStyle_2.5Fin', 'RoofStyle_Mansard', 'RoofStyle_Shed',
'RoofMatl_Metal', 'RoofMatl_Tar&Grv', 'RoofMatl_WdShngl',
'Exterior1st_AsphShn', 'Exterior1st_CBlock', 'Exterior1st_CemntBd',
'Exterior1st_Stone', 'Exterior1st_Wd Sdng', 'Exterior2nd_AsphShn',
'Exterior2nd_BrkFace', 'Exterior2nd_CBlock', 'Exterior2nd_CmentBd',
'Exterior2nd_Other', 'Exterior2nd_Wd Sdng', 'ExterQual_Gd',
'ExterQual_TA', 'ExterCond_Po', 'Foundation_Wood', 'BsmtQual_Fa',
'BsmtQual_Gd', 'BsmtQual_TA', 'BsmtQual_none', 'BsmtCond_Po',
'BsmtCond_none', 'BsmtExposure_Gd', 'BsmtExposure_none',
'BsmtFinType1_none', 'BsmtFinType2_none', 'Heating_GasA',
'Heating_GasW', 'Heating_Grav', 'Heating_OthW', 'Heating_Wall',
'HeatingQC_Po', 'Electrical_Mix', 'KitchenQual_Fa', 'KitchenQual_Gd',
'KitchenQual_TA', 'Functional_Mod', 'Functional_Sev', 'Functional_Typ',
'GarageType_Attchd', 'GarageType_Basment', 'GarageType_BuiltIn',
'GarageType_Detchd', 'GarageType_none', 'GarageFinish_none',
'GarageQual_Fa', 'GarageQual_Gd', 'GarageQual_Po', 'GarageQual_TA',
'GarageQual_none', 'GarageCond_Fa', 'GarageCond_Gd', 'GarageCond_Po',
'GarageCond_TA', 'GarageCond_none', 'SaleType_CWD', 'SaleType_Con',
'SaleType_New', 'SaleCondition_Family'],
dtype='object')
#Creating X_train using RFE selected variables
#We are using the function of statsmodels here
X_train_rfe = X_train[Rfe_Cols] #X_train_rfe will now have all the RFE selected features
X_train_rfe = sm.add_constant(X_train_rfe) # adding the constant c to the variables to form the equation y = mx + c
X_train_rfe.shape
(940, 101)
#Running the Model
lm = sm.OLS(y_train,X_train_rfe).fit()
#Stats summary of the model
print(lm.summary())
OLS Regression Results
==============================================================================
Dep. Variable: SalePrice R-squared: 0.935
Model: OLS Adj. R-squared: 0.929
Method: Least Squares F-statistic: 143.1
Date: Sun, 21 Jan 2024 Prob (F-statistic): 0.00
Time: 19:35:07 Log-Likelihood: 1914.1
No. Observations: 940 AIC: -3654.
Df Residuals: 853 BIC: -3233.
Df Model: 86
Covariance Type: nonrobust
========================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------
const -0.0700 0.042 -1.676 0.094 -0.152 0.012
LotArea 0.0391 0.010 3.866 0.000 0.019 0.059
OverallQual 0.1117 0.015 7.623 0.000 0.083 0.141
OverallCond 0.0768 0.010 7.779 0.000 0.057 0.096
BsmtUnfSF 0.0734 0.008 9.253 0.000 0.058 0.089
BedroomAbvGr -0.0718 0.014 -5.169 0.000 -0.099 -0.045
KitchenAbvGr -0.0541 0.024 -2.214 0.027 -0.102 -0.006
TotRmsAbvGrd 0.0277 0.015 1.851 0.065 -0.002 0.057
GarageArea 0.0595 0.012 5.065 0.000 0.036 0.083
PoolArea 0.0427 0.025 1.698 0.090 -0.007 0.092
MiscVal 0.1449 0.055 2.634 0.009 0.037 0.253
Total_sqr_footage 0.2904 0.015 19.462 0.000 0.261 0.320
Total_Bathrooms 0.0436 0.013 3.478 0.001 0.019 0.068
Total_Age -0.0884 0.012 -7.377 0.000 -0.112 -0.065
MSZoning_FV -0.0161 0.011 -1.432 0.153 -0.038 0.006
MSZoning_RL 0.0071 0.004 1.802 0.072 -0.001 0.015
Street_Pave 0.1042 0.027 3.918 0.000 0.052 0.156
Utilities_NoSeWa -2.046e-16 5.07e-17 -4.035 0.000 -3.04e-16 -1.05e-16
LandSlope_Sev -0.0370 0.024 -1.512 0.131 -0.085 0.011
Neighborhood_Crawfor 0.0353 0.007 5.126 0.000 0.022 0.049
Neighborhood_MeadowV -0.0152 0.014 -1.068 0.286 -0.043 0.013
Neighborhood_Mitchel -0.0167 0.006 -2.650 0.008 -0.029 -0.004
Neighborhood_NoRidge 0.0476 0.008 5.742 0.000 0.031 0.064
Neighborhood_NridgHt 0.0504 0.007 7.317 0.000 0.037 0.064
Neighborhood_Somerst 0.0415 0.010 4.162 0.000 0.022 0.061
Neighborhood_StoneBr 0.0888 0.011 8.087 0.000 0.067 0.110
Condition1_Feedr 0.0214 0.008 2.815 0.005 0.006 0.036
Condition1_Norm 0.0358 0.006 5.791 0.000 0.024 0.048
Condition1_PosN 0.0473 0.013 3.696 0.000 0.022 0.072
Condition1_RRAe -0.0054 0.013 -0.412 0.680 -0.031 0.020
Condition1_RRAn 0.0227 0.010 2.238 0.026 0.003 0.043
Condition1_RRNe 0.0314 0.034 0.927 0.354 -0.035 0.098
Condition2_PosN -0.0781 0.037 -2.131 0.033 -0.150 -0.006
Condition2_RRAe -0.0988 0.033 -3.016 0.003 -0.163 -0.035
BldgType_Duplex -0.0355 0.009 -3.926 0.000 -0.053 -0.018
BldgType_Twnhs -0.0387 0.008 -4.711 0.000 -0.055 -0.023
BldgType_TwnhsE -0.0265 0.006 -4.771 0.000 -0.037 -0.016
HouseStyle_2.5Fin -0.0666 0.021 -3.143 0.002 -0.108 -0.025
RoofStyle_Mansard 0.0181 0.017 1.057 0.291 -0.016 0.052
RoofStyle_Shed -0.0988 0.033 -3.016 0.003 -0.163 -0.035
RoofMatl_Metal 0.0374 0.042 0.886 0.376 -0.046 0.120
RoofMatl_Tar&Grv 0.0088 0.026 0.332 0.740 -0.043 0.061
RoofMatl_WdShngl -0.0324 0.035 -0.935 0.350 -0.100 0.036
Exterior1st_AsphShn 8.144e-18 4.59e-17 0.178 0.859 -8.19e-17 9.82e-17
Exterior1st_CBlock -0.0087 0.018 -0.494 0.621 -0.043 0.026
Exterior1st_CemntBd -0.0330 0.025 -1.333 0.183 -0.082 0.016
Exterior1st_Stone -0.0128 0.035 -0.365 0.715 -0.081 0.056
Exterior1st_Wd Sdng -0.0178 0.007 -2.377 0.018 -0.033 -0.003
Exterior2nd_AsphShn -1.417e-17 2.05e-17 -0.691 0.490 -5.44e-17 2.61e-17
Exterior2nd_BrkFace 0.0189 0.010 1.944 0.052 -0.000 0.038
Exterior2nd_CBlock -0.0087 0.018 -0.494 0.621 -0.043 0.026
Exterior2nd_CmentBd 0.0538 0.025 2.145 0.032 0.005 0.103
Exterior2nd_Other -0.0332 0.035 -0.938 0.349 -0.103 0.036
Exterior2nd_Wd Sdng 0.0210 0.007 2.818 0.005 0.006 0.036
ExterQual_Gd -0.0133 0.008 -1.688 0.092 -0.029 0.002
ExterQual_TA -0.0217 0.008 -2.639 0.008 -0.038 -0.006
ExterCond_Po 0.0605 0.039 1.567 0.118 -0.015 0.136
Foundation_Wood -0.0519 0.034 -1.511 0.131 -0.119 0.016
BsmtQual_Fa -0.0520 0.010 -5.013 0.000 -0.072 -0.032
BsmtQual_Gd -0.0516 0.006 -8.662 0.000 -0.063 -0.040
BsmtQual_TA -0.0521 0.007 -7.321 0.000 -0.066 -0.038
BsmtQual_none -0.0014 0.009 -0.164 0.869 -0.019 0.016
BsmtCond_Po 0.0222 0.044 0.507 0.612 -0.064 0.108
BsmtCond_none -0.0014 0.009 -0.164 0.869 -0.019 0.016
BsmtExposure_Gd 0.0343 0.005 7.268 0.000 0.025 0.044
BsmtExposure_none -0.0249 0.033 -0.747 0.455 -0.090 0.041
BsmtFinType1_none -0.0014 0.009 -0.164 0.869 -0.019 0.016
BsmtFinType2_none -0.0014 0.009 -0.164 0.869 -0.019 0.016
Heating_GasA 0.0023 0.012 0.194 0.846 -0.021 0.026
Heating_GasW -0.0155 0.015 -1.026 0.305 -0.045 0.014
Heating_Grav -0.0270 0.026 -1.053 0.293 -0.077 0.023
Heating_OthW -0.0542 0.031 -1.772 0.077 -0.114 0.006
Heating_Wall 0.0243 0.021 1.143 0.253 -0.017 0.066
HeatingQC_Po -2.81e-18 1.05e-17 -0.268 0.789 -2.34e-17 1.78e-17
Electrical_Mix -0.0184 0.061 -0.300 0.764 -0.139 0.102
KitchenQual_Fa -0.0364 0.011 -3.460 0.001 -0.057 -0.016
KitchenQual_Gd -0.0366 0.007 -5.488 0.000 -0.050 -0.024
KitchenQual_TA -0.0380 0.007 -5.188 0.000 -0.052 -0.024
Functional_Mod -0.0294 0.017 -1.770 0.077 -0.062 0.003
Functional_Sev -0.1623 0.044 -3.670 0.000 -0.249 -0.076
Functional_Typ 0.0240 0.005 4.559 0.000 0.014 0.034
GarageType_Attchd 0.0105 0.012 0.886 0.376 -0.013 0.034
GarageType_Basment 0.0115 0.017 0.686 0.493 -0.021 0.044
GarageType_BuiltIn 0.0188 0.013 1.449 0.148 -0.007 0.044
GarageType_Detchd 0.0180 0.012 1.553 0.121 -0.005 0.041
GarageType_none 0.0011 0.009 0.115 0.908 -0.017 0.019
GarageFinish_none 0.0011 0.009 0.115 0.908 -0.017 0.019
GarageQual_Fa -0.0010 0.021 -0.046 0.963 -0.042 0.040
GarageQual_Gd 0.0215 0.023 0.948 0.343 -0.023 0.066
GarageQual_Po -0.0766 0.038 -1.996 0.046 -0.152 -0.001
GarageQual_TA 0.0093 0.021 0.451 0.652 -0.031 0.050
GarageQual_none 0.0011 0.009 0.115 0.908 -0.017 0.019
GarageCond_Fa -0.0255 0.021 -1.217 0.224 -0.067 0.016
GarageCond_Gd -0.0361 0.026 -1.403 0.161 -0.087 0.014
GarageCond_Po 0.0391 0.040 0.987 0.324 -0.039 0.117
GarageCond_TA -0.0244 0.020 -1.193 0.233 -0.064 0.016
GarageCond_none 0.0011 0.009 0.115 0.908 -0.017 0.019
SaleType_CWD 0.0230 0.020 1.132 0.258 -0.017 0.063
SaleType_Con 0.1121 0.035 3.181 0.002 0.043 0.181
SaleType_New 0.0301 0.005 5.967 0.000 0.020 0.040
SaleCondition_Family -0.0282 0.009 -2.982 0.003 -0.047 -0.010
==============================================================================
Omnibus: 272.441 Durbin-Watson: 2.052
Prob(Omnibus): 0.000 Jarque-Bera (JB): 3475.595
Skew: 0.947 Prob(JB): 0.00
Kurtosis: 12.228 Cond. No. 1.06e+16
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 8.96e-29. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
#Listing VIF of all columns
vif = pd.DataFrame()
vif['Features'] = X_train_rfe.columns #Column Names
vif['VIF'] = [variance_inflation_factor(X_train_rfe.values, i) for i in range(X_train_rfe.shape[1])] #VIF value
vif['VIF'] = round(vif['VIF'], 2) #Rounding to 2 decimal places
vif = vif.sort_values(by = "VIF", ascending = False) #arranging in decending order
vif
| Features | VIF | |
|---|---|---|
| 95 | GarageCond_TA | inf |
| 33 | Condition2_RRAe | inf |
| 96 | GarageCond_none | inf |
| 44 | Exterior1st_CBlock | inf |
| 93 | GarageCond_Gd | inf |
| 50 | Exterior2nd_CBlock | inf |
| 94 | GarageCond_Po | inf |
| 61 | BsmtQual_none | inf |
| 63 | BsmtCond_none | inf |
| 66 | BsmtFinType1_none | inf |
| 67 | BsmtFinType2_none | inf |
| 68 | Heating_GasA | inf |
| 69 | Heating_GasW | inf |
| 70 | Heating_Grav | inf |
| 71 | Heating_OthW | inf |
| 72 | Heating_Wall | inf |
| 85 | GarageType_none | inf |
| 86 | GarageFinish_none | inf |
| 87 | GarageQual_Fa | inf |
| 88 | GarageQual_Gd | inf |
| 89 | GarageQual_Po | inf |
| 90 | GarageQual_TA | inf |
| 91 | GarageQual_none | inf |
| 39 | RoofStyle_Shed | inf |
| 92 | GarageCond_Fa | inf |
| 81 | GarageType_Attchd | 28.80 |
| 65 | BsmtExposure_none | 27.49 |
| 84 | GarageType_Detchd | 23.11 |
| 51 | Exterior2nd_CmentBd | 21.39 |
| 45 | Exterior1st_CemntBd | 20.86 |
| 55 | ExterQual_TA | 13.43 |
| 54 | ExterQual_Gd | 11.65 |
| 77 | KitchenQual_TA | 11.48 |
| 60 | BsmtQual_TA | 10.70 |
| 76 | KitchenQual_Gd | 9.16 |
| 59 | BsmtQual_Gd | 7.45 |
| 83 | GarageType_BuiltIn | 7.36 |
| 13 | Total_Age | 6.72 |
| 11 | Total_sqr_footage | 6.03 |
| 47 | Exterior1st_Wd Sdng | 5.46 |
| 53 | Exterior2nd_Wd Sdng | 5.38 |
| 24 | Neighborhood_Somerst | 5.00 |
| 14 | MSZoning_FV | 4.95 |
| 7 | TotRmsAbvGrd | 4.40 |
| 2 | OverallQual | 4.11 |
| 10 | MiscVal | 3.87 |
| 27 | Condition1_Norm | 3.79 |
| 62 | BsmtCond_Po | 3.48 |
| 74 | Electrical_Mix | 3.42 |
| 12 | Total_Bathrooms | 3.29 |
| 8 | GarageArea | 3.27 |
| 6 | KitchenAbvGr | 2.83 |
| 34 | BldgType_Duplex | 2.65 |
| 75 | KitchenQual_Fa | 2.64 |
| 5 | BedroomAbvGr | 2.64 |
| 26 | Condition1_Feedr | 2.63 |
| 58 | BsmtQual_Fa | 2.57 |
| 1 | LotArea | 2.43 |
| 15 | MSZoning_RL | 2.29 |
| 82 | GarageType_Basment | 2.27 |
| 18 | LandSlope_Sev | 2.17 |
| 4 | BsmtUnfSF | 2.16 |
| 41 | RoofMatl_Tar&Grv | 1.91 |
| 36 | BldgType_TwnhsE | 1.89 |
| 79 | Functional_Sev | 1.78 |
| 23 | Neighborhood_NridgHt | 1.77 |
| 30 | Condition1_RRAn | 1.65 |
| 20 | Neighborhood_MeadowV | 1.64 |
| 40 | RoofMatl_Metal | 1.62 |
| 3 | OverallCond | 1.62 |
| 99 | SaleType_New | 1.52 |
| 78 | Functional_Mod | 1.50 |
| 35 | BldgType_Twnhs | 1.50 |
| 28 | Condition1_PosN | 1.47 |
| 80 | Functional_Typ | 1.43 |
| 29 | Condition1_RRAe | 1.38 |
| 19 | Neighborhood_Crawfor | 1.37 |
| 64 | BsmtExposure_Gd | 1.37 |
| 56 | ExterCond_Po | 1.36 |
| 25 | Neighborhood_StoneBr | 1.30 |
| 16 | Street_Pave | 1.28 |
| 32 | Condition2_PosN | 1.22 |
| 22 | Neighborhood_NoRidge | 1.22 |
| 37 | HouseStyle_2.5Fin | 1.22 |
| 49 | Exterior2nd_BrkFace | 1.19 |
| 21 | Neighborhood_Mitchel | 1.19 |
| 52 | Exterior2nd_Other | 1.14 |
| 98 | SaleType_Con | 1.13 |
| 97 | SaleType_CWD | 1.13 |
| 100 | SaleCondition_Family | 1.13 |
| 46 | Exterior1st_Stone | 1.11 |
| 42 | RoofMatl_WdShngl | 1.09 |
| 57 | Foundation_Wood | 1.07 |
| 38 | RoofStyle_Mansard | 1.06 |
| 31 | Condition1_RRNe | 1.05 |
| 9 | PoolArea | 1.03 |
| 0 | const | 0.00 |
| 17 | Utilities_NoSeWa | NaN |
| 43 | Exterior1st_AsphShn | NaN |
| 48 | Exterior2nd_AsphShn | NaN |
| 73 | HeatingQC_Po | NaN |
# Many variables have high correlation
rfe = RFE(lr)
rfe.fit(X_train_rfe,y_train)
col = X_train_rfe.columns[rfe.support_]
X_train_rfe = X_train_rfe[col]
X_train_rfe = sm.add_constant(X_train_rfe)
lm = sm.OLS(y_train,X_train_rfe).fit()
print(lm.summary())
OLS Regression Results
==============================================================================
Dep. Variable: SalePrice R-squared: 0.915
Model: OLS Adj. R-squared: 0.911
Method: Least Squares F-statistic: 225.6
Date: Sun, 21 Jan 2024 Prob (F-statistic): 0.00
Time: 19:35:10 Log-Likelihood: 1789.2
No. Observations: 940 AIC: -3490.
Df Residuals: 896 BIC: -3277.
Df Model: 43
Covariance Type: nonrobust
========================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------
const 2.213e+09 1.44e+11 0.015 0.988 -2.8e+11 2.85e+11
OverallQual 0.1325 0.015 8.871 0.000 0.103 0.162
OverallCond 0.0796 0.010 7.607 0.000 0.059 0.100
BsmtUnfSF 0.0830 0.008 10.302 0.000 0.067 0.099
BedroomAbvGr -0.0809 0.013 -6.225 0.000 -0.106 -0.055
KitchenAbvGr -0.1384 0.019 -7.242 0.000 -0.176 -0.101
GarageArea 0.0683 0.012 5.741 0.000 0.045 0.092
MiscVal 0.1199 0.059 2.033 0.042 0.004 0.236
Total_sqr_footage 0.3060 0.013 22.687 0.000 0.279 0.332
Total_Bathrooms 0.0438 0.014 3.204 0.001 0.017 0.071
Total_Age -0.0977 0.011 -9.052 0.000 -0.119 -0.077
Street_Pave 0.0726 0.027 2.669 0.008 0.019 0.126
Neighborhood_Crawfor 0.0403 0.007 5.635 0.000 0.026 0.054
Neighborhood_NoRidge 0.0480 0.009 5.324 0.000 0.030 0.066
Neighborhood_NridgHt 0.0465 0.007 6.537 0.000 0.033 0.060
Neighborhood_StoneBr 0.0877 0.011 7.733 0.000 0.065 0.110
Condition2_PosN -0.0655 0.038 -1.727 0.084 -0.140 0.009
Condition2_RRAe -5.202e+06 3.39e+08 -0.015 0.988 -6.7e+08 6.59e+08
BldgType_Twnhs -0.0501 0.008 -6.350 0.000 -0.066 -0.035
BldgType_TwnhsE -0.0391 0.005 -7.405 0.000 -0.050 -0.029
HouseStyle_2.5Fin -0.0628 0.023 -2.699 0.007 -0.108 -0.017
RoofStyle_Shed 5.202e+06 3.39e+08 0.015 0.988 -6.59e+08 6.7e+08
Exterior1st_CBlock 7.813e+06 5.08e+08 0.015 0.988 -9.9e+08 1.01e+09
Exterior2nd_CBlock -7.813e+06 5.08e+08 -0.015 0.988 -1.01e+09 9.9e+08
Foundation_Wood -0.0640 0.038 -1.701 0.089 -0.138 0.010
BsmtQual_Fa -0.0580 0.011 -5.242 0.000 -0.080 -0.036
BsmtQual_Gd -0.0559 0.006 -8.903 0.000 -0.068 -0.044
BsmtQual_TA -0.0596 0.007 -7.963 0.000 -0.074 -0.045
BsmtQual_none -8.876e+06 5.78e+08 -0.015 0.988 -1.14e+09 1.12e+09
BsmtCond_none 5.548e+06 3.61e+08 0.015 0.988 -7.03e+08 7.14e+08
BsmtFinType1_none -2.289e+06 1.49e+08 -0.015 0.988 -2.95e+08 2.9e+08
BsmtFinType2_none 5.617e+06 3.66e+08 0.015 0.988 -7.12e+08 7.23e+08
Heating_GasA -2.213e+09 1.44e+11 -0.015 0.988 -2.85e+11 2.8e+11
Heating_GasW -2.213e+09 1.44e+11 -0.015 0.988 -2.85e+11 2.8e+11
Heating_Grav -2.213e+09 1.44e+11 -0.015 0.988 -2.85e+11 2.8e+11
Heating_OthW -2.213e+09 1.44e+11 -0.015 0.988 -2.85e+11 2.8e+11
Heating_Wall -2.213e+09 1.44e+11 -0.015 0.988 -2.85e+11 2.8e+11
KitchenQual_Fa -0.0562 0.011 -5.179 0.000 -0.078 -0.035
KitchenQual_Gd -0.0502 0.007 -7.494 0.000 -0.063 -0.037
KitchenQual_TA -0.0586 0.007 -8.032 0.000 -0.073 -0.044
Functional_Sev -0.1762 0.037 -4.712 0.000 -0.250 -0.103
GarageQual_Fa -8.906e+08 5.8e+10 -0.015 0.988 -1.15e+11 1.13e+11
GarageQual_Gd -8.906e+08 5.8e+10 -0.015 0.988 -1.15e+11 1.13e+11
GarageQual_Po -8.906e+08 5.8e+10 -0.015 0.988 -1.15e+11 1.13e+11
GarageQual_TA -8.906e+08 5.8e+10 -0.015 0.988 -1.15e+11 1.13e+11
GarageCond_Fa 8.906e+08 5.8e+10 0.015 0.988 -1.13e+11 1.15e+11
GarageCond_Gd 8.906e+08 5.8e+10 0.015 0.988 -1.13e+11 1.15e+11
GarageCond_Po 8.906e+08 5.8e+10 0.015 0.988 -1.13e+11 1.15e+11
GarageCond_TA 8.906e+08 5.8e+10 0.015 0.988 -1.13e+11 1.15e+11
SaleType_Con 0.1669 0.038 4.438 0.000 0.093 0.241
SaleType_New 0.0353 0.005 6.614 0.000 0.025 0.046
==============================================================================
Omnibus: 248.804 Durbin-Watson: 2.046
Prob(Omnibus): 0.000 Jarque-Bera (JB): 2677.623
Skew: 0.884 Prob(JB): 0.00
Kurtosis: 11.077 Cond. No. 1.09e+16
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 5.6e-29. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
#Listing VIF of all columns
vif = pd.DataFrame()
vif['Features'] = X_train_rfe.columns #Column Names
vif['VIF'] = [variance_inflation_factor(X_train_rfe.values, i) for i in range(X_train_rfe.shape[1])] #VIF value
vif['VIF'] = round(vif['VIF'], 2) #Rounding to 2 decimal places
vif = vif.sort_values(by = "VIF", ascending = False) #arranging in decending order
vif
| Features | VIF | |
|---|---|---|
| 45 | GarageCond_Fa | inf |
| 22 | Exterior1st_CBlock | inf |
| 17 | Condition2_RRAe | inf |
| 44 | GarageQual_TA | inf |
| 21 | RoofStyle_Shed | inf |
| 32 | Heating_GasA | inf |
| 46 | GarageCond_Gd | inf |
| 47 | GarageCond_Po | inf |
| 48 | GarageCond_TA | inf |
| 23 | Exterior2nd_CBlock | inf |
| 42 | GarageQual_Gd | inf |
| 41 | GarageQual_Fa | inf |
| 33 | Heating_GasW | inf |
| 34 | Heating_Grav | inf |
| 35 | Heating_OthW | inf |
| 36 | Heating_Wall | inf |
| 43 | GarageQual_Po | inf |
| 28 | BsmtQual_none | 6145824.33 |
| 29 | BsmtCond_none | 6145824.33 |
| 30 | BsmtFinType1_none | 6145824.33 |
| 31 | BsmtFinType2_none | 6145824.33 |
| 27 | BsmtQual_TA | 9.53 |
| 39 | KitchenQual_TA | 9.16 |
| 38 | KitchenQual_Gd | 7.44 |
| 26 | BsmtQual_Gd | 6.66 |
| 10 | Total_Age | 4.38 |
| 8 | Total_sqr_footage | 3.97 |
| 7 | MiscVal | 3.58 |
| 1 | OverallQual | 3.44 |
| 9 | Total_Bathrooms | 3.15 |
| 6 | GarageArea | 2.70 |
| 25 | BsmtQual_Fa | 2.35 |
| 37 | KitchenQual_Fa | 2.26 |
| 4 | BedroomAbvGr | 1.86 |
| 3 | BsmtUnfSF | 1.79 |
| 14 | Neighborhood_NridgHt | 1.52 |
| 2 | OverallCond | 1.47 |
| 5 | KitchenAbvGr | 1.39 |
| 19 | BldgType_TwnhsE | 1.38 |
| 50 | SaleType_New | 1.37 |
| 20 | HouseStyle_2.5Fin | 1.19 |
| 12 | Neighborhood_Crawfor | 1.19 |
| 13 | Neighborhood_NoRidge | 1.17 |
| 15 | Neighborhood_StoneBr | 1.12 |
| 18 | BldgType_Twnhs | 1.11 |
| 11 | Street_Pave | 1.08 |
| 16 | Condition2_PosN | 1.05 |
| 24 | Foundation_Wood | 1.04 |
| 49 | SaleType_Con | 1.03 |
| 40 | Functional_Sev | 1.02 |
| 0 | const | 0.00 |
rfe=rfe.fit(X_train_rfe,y_train)
print(X_train_rfe.shape)
print(y_train.shape)
(940, 51) (940,)
y_train_pred = rfe.predict(X_train_rfe)
## Train score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
#Returns the mean squared error; we'll take a square root
np.sqrt(mean_squared_error(y_train, y_train_pred))
0.040582976524661515
On average, the squared difference between the actual and predicted values is On average, the squared difference between the actual and predicted values is 5.22.
r_squared = r2_score(y_train, y_train_pred)
r_squared
0.892961031888519
# Create a list of columns to drop from X_test
columns_to_drop = [col for col in X_test.columns if col not in X_train_rfe.columns]
print(len(columns_to_drop))
print(X_test.shape)
174 (403, 224)
## Test score
X_test_rfe = X_test.drop(columns=columns_to_drop)
X_test_rfe.shape
(403, 50)
rfe=rfe.fit(X_test_rfe,y_test)
y_test_pred = rfe.predict(X_test_rfe)
print(np.sqrt(mean_squared_error(y_test, y_test_pred)))
print(r2_score(y_test, y_test_pred))
0.03943445902844407 0.9024503770845567
Model 2: Ridge and Lasso¶
## lets perform Rige and Lasso on the columns given by rfe
Ridge - Regularization¶
print(X_train_rfe.shape)
print(y_train.shape)
X_train = X_train_rfe
(940, 51) (940,)
# Considering following alphas
params = {'alpha': [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0,
9.0, 10.0, 20, 50, 100, 500, 1000 ]}
ridge = Ridge()
# cross validation
folds = 5
ridge_model_cv = GridSearchCV(estimator = ridge,
param_grid = params,
scoring= 'neg_mean_absolute_error',
cv = folds,
return_train_score=True,
verbose = 1)
ridge_model_cv.fit(X_train, y_train)
Fitting 5 folds for each of 27 candidates, totalling 135 fits
GridSearchCV(cv=5, estimator=Ridge(),
param_grid={'alpha': [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5,
0.6, 0.7, 0.8, 0.9, 1.0, 2.0, 3.0, 4.0, 5.0,
6.0, 7.0, 8.0, 9.0, 10.0, 20, 50, 100, 500,
1000]},
return_train_score=True, scoring='neg_mean_absolute_error',
verbose=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=5, estimator=Ridge(),
param_grid={'alpha': [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5,
0.6, 0.7, 0.8, 0.9, 1.0, 2.0, 3.0, 4.0, 5.0,
6.0, 7.0, 8.0, 9.0, 10.0, 20, 50, 100, 500,
1000]},
return_train_score=True, scoring='neg_mean_absolute_error',
verbose=1)Ridge()
Ridge()
ridge_cv_results = pd.DataFrame(ridge_model_cv.cv_results_)
ridge_cv_results = ridge_cv_results[ridge_cv_results['param_alpha']<=500]
ridge_cv_results[['param_alpha', 'mean_train_score', 'mean_test_score', 'rank_test_score']].sort_values(by = ['rank_test_score'])
| param_alpha | mean_train_score | mean_test_score | rank_test_score | |
|---|---|---|---|---|
| 12 | 1.0 | -0.025526 | -0.027323 | 1 |
| 11 | 0.9 | -0.025489 | -0.027327 | 2 |
| 10 | 0.8 | -0.025451 | -0.027334 | 3 |
| 9 | 0.7 | -0.025413 | -0.027346 | 4 |
| 8 | 0.6 | -0.025376 | -0.027366 | 5 |
| 7 | 0.5 | -0.025339 | -0.027397 | 6 |
| 13 | 2.0 | -0.025916 | -0.027405 | 7 |
| 6 | 0.4 | -0.025302 | -0.027437 | 8 |
| 5 | 0.3 | -0.025264 | -0.027487 | 9 |
| 4 | 0.2 | -0.025223 | -0.027550 | 10 |
| 14 | 3.0 | -0.026298 | -0.027628 | 11 |
| 3 | 0.1 | -0.025181 | -0.027636 | 12 |
| 2 | 0.01 | -0.025145 | -0.027752 | 13 |
| 1 | 0.001 | -0.025144 | -0.027767 | 14 |
| 0 | 0.0001 | -0.025144 | -0.027768 | 15 |
| 15 | 4.0 | -0.026668 | -0.027879 | 16 |
| 16 | 5.0 | -0.027050 | -0.028167 | 17 |
| 17 | 6.0 | -0.027436 | -0.028490 | 18 |
| 18 | 7.0 | -0.027829 | -0.028839 | 19 |
| 19 | 8.0 | -0.028237 | -0.029198 | 20 |
| 20 | 9.0 | -0.028652 | -0.029593 | 21 |
| 21 | 10.0 | -0.029070 | -0.029998 | 22 |
| 22 | 20 | -0.032929 | -0.033775 | 23 |
| 23 | 50 | -0.041236 | -0.041909 | 24 |
| 24 | 100 | -0.049096 | -0.049710 | 25 |
| 25 | 500 | -0.067564 | -0.067853 | 26 |
# plotting Negative Mean Absolute Error vs alpha for train and test
ridge_cv_results['param_alpha'] = ridge_cv_results['param_alpha'].astype('int32')
plt.figure(figsize=(8,5))
plt.plot(ridge_cv_results['param_alpha'], ridge_cv_results['mean_train_score'])
plt.plot(ridge_cv_results['param_alpha'], ridge_cv_results['mean_test_score'])
plt.xlabel('alpha parameter')
plt.ylabel('Negative Mean Absolute Error (Ridge regression)')
plt.title("Negative Mean Absolute Error and alpha\n",fontsize=15)
plt.legend(['train score', 'test score'], loc='upper right')
plt.show()
ridge_model_cv.best_params_
{'alpha': 1.0}
## Train score
# Hyperparameter lambda = 1.0
alpha = 1.0
ridge = Ridge(alpha=alpha)
ridge.fit(X_train, y_train)
#ridge.coef_
#Lets calculate the mean squared error value
mse = mean_squared_error(y_train, ridge.predict(X_train))
print("The mean squared error value is ",mse)
# predicting the R2 value of train data
y_train_pred = ridge.predict(X_train)
r2_train = metrics.r2_score(y_true=y_train, y_pred=y_train_pred)
print("The r2 value of train data is ",r2_train)
The mean squared error value is 0.0013402777692339 The r2 value of train data is 0.9128938268574391
## Test score
# Create a list of columns to drop from X_test
columns_to_drop = [col for col in X_test.columns if col not in X_train.columns]
print(len(columns_to_drop))
print(X_test.shape)
X_test_rfe = X_test.drop(columns=columns_to_drop)
ridge.fit(X_test_rfe,y_test)
y_test_pred = ridge.predict(X_test_rfe)
print(np.sqrt(mean_squared_error(y_test, y_test_pred)))
print(r2_score(y_test, y_test_pred))
174 (403, 224) 0.039247725758071256 0.9033720396193562
model_parameter = list(ridge.coef_)
model_parameter.insert(0,ridge.intercept_)
cols = X_test.columns
cols.insert(0,'constant')
ridge_coef = pd.DataFrame(list(zip(cols,model_parameter)))
ridge_coef.columns = ['Feaure','Coef']
ridge_coef.sort_values(by='Coef',ascending=False).head(10)
| Feaure | Coef | |
|---|---|---|
| 8 | LowQualFinSF | 0.237707 |
| 1 | LotFrontage | 0.181180 |
| 12 | Fireplaces | 0.093996 |
| 0 | MSSubClass | 0.090422 |
| 2 | LotArea | 0.079129 |
| 6 | BsmtUnfSF | 0.062856 |
| 15 | PoolArea | 0.056957 |
| 13 | GarageCars | 0.054825 |
| 9 | BedroomAbvGr | 0.051661 |
| 14 | GarageArea | 0.046775 |
9.3 Lasso - Regularization¶
lasso = Lasso()
# Considering following alphas
params = {'alpha': [0.0001, 0.0002, 0.0003, 0.0004, 0.0005, 0.001, 0.002, 0.003, 0.004, 0.005, 0.01]}
# cross validation
folds = 5
lasso_model_cv = GridSearchCV(estimator = lasso,
param_grid = params,
scoring= 'neg_mean_absolute_error',
cv = folds,
return_train_score=True,
verbose = 1)
lasso_model_cv.fit(X_train, y_train)
Fitting 5 folds for each of 11 candidates, totalling 55 fits
GridSearchCV(cv=5, estimator=Lasso(),
param_grid={'alpha': [0.0001, 0.0002, 0.0003, 0.0004, 0.0005,
0.001, 0.002, 0.003, 0.004, 0.005, 0.01]},
return_train_score=True, scoring='neg_mean_absolute_error',
verbose=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=5, estimator=Lasso(),
param_grid={'alpha': [0.0001, 0.0002, 0.0003, 0.0004, 0.0005,
0.001, 0.002, 0.003, 0.004, 0.005, 0.01]},
return_train_score=True, scoring='neg_mean_absolute_error',
verbose=1)Lasso()
Lasso()
lasso_cv_results = pd.DataFrame(lasso_model_cv.cv_results_)
lasso_cv_results = lasso_cv_results[lasso_cv_results['param_alpha']<=500]
lasso_cv_results[['param_alpha', 'mean_train_score', 'mean_test_score', 'rank_test_score']].sort_values(by = ['rank_test_score'])
| param_alpha | mean_train_score | mean_test_score | rank_test_score | |
|---|---|---|---|---|
| 0 | 0.0001 | -0.025815 | -0.027356 | 1 |
| 1 | 0.0002 | -0.026394 | -0.027602 | 2 |
| 2 | 0.0003 | -0.026897 | -0.028060 | 3 |
| 3 | 0.0004 | -0.027545 | -0.028690 | 4 |
| 4 | 0.0005 | -0.028305 | -0.029426 | 5 |
| 5 | 0.001 | -0.031726 | -0.032461 | 6 |
| 6 | 0.002 | -0.034447 | -0.034865 | 7 |
| 7 | 0.003 | -0.036620 | -0.036985 | 8 |
| 8 | 0.004 | -0.039353 | -0.039719 | 9 |
| 9 | 0.005 | -0.042696 | -0.043141 | 10 |
| 10 | 0.01 | -0.060830 | -0.061227 | 11 |
# plotting Negative Mean Absolute Error vs alpha for train and test
lasso_cv_results['param_alpha'] = lasso_cv_results['param_alpha'].astype('float64')
plt.figure(figsize=(10,8))
plt.plot(lasso_cv_results['param_alpha'], lasso_cv_results['mean_train_score'])
plt.plot(lasso_cv_results['param_alpha'], lasso_cv_results['mean_test_score'])
plt.xlabel('alpha parameter')
plt.ylabel('Negative Mean Absolute Error (Lasso regression)')
plt.title("Negative Mean Absolute Error and alpha (Lasso regression)")
plt.legend(['train score', 'test score'], loc='upper right')
plt.show()
# lambda best estimator
lasso_model_cv.best_estimator_
Lasso(alpha=0.0001)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Lasso(alpha=0.0001)
# # Hyperparameter lambda = 0001
alpha = 0.0001
lasso = Lasso(alpha=alpha)
lasso.fit(X_train, y_train)
lasso.coef_
array([ 0. , 0.14936811, 0.07683105, 0.07566492, -0.07053207,
-0.11401635, 0.05506468, -0. , 0.29853129, 0.04153133,
-0.10228039, 0.01518065, 0.03516087, 0.04418029, 0.04738503,
0.08101888, -0. , -0. , -0.04706271, -0.03554114,
-0.02340574, -0. , -0. , -0. , -0. ,
-0.04301898, -0.04935221, -0.05051161, -0.02254578, -0. ,
-0. , -0. , 0.00277795, -0.00666772, -0. ,
-0. , 0. , -0.04908881, -0.04489768, -0.05603447,
-0.07706967, -0.00870119, 0.00747869, -0. , -0.00106238,
-0. , -0. , -0. , -0. , 0.06879055,
0.03647498])
## X_train score
#Lets calculate the mean squared error value
mse = mean_squared_error(y_train, lasso.predict(X_train))
print("The mean squared error value is ",mse)
# predicting the R2 value of train data
y_train_pred = lasso.predict(X_train)
r2_train = metrics.r2_score(y_true=y_train, y_pred=y_train_pred)
print("The r2 value of train data is ",r2_train)
The mean squared error value is 0.0013669680543668022 The r2 value of train data is 0.9111591949390572
## test score
# Create a list of columns to drop from X_test
columns_to_drop = [col for col in X_test.columns if col not in X_train.columns]
print(len(columns_to_drop))
print(X_test.shape)
X_test_rfe = X_test.drop(columns=columns_to_drop)
lasso.fit(X_test_rfe,y_test)
y_test_pred = lasso.predict(X_test_rfe)
print(np.sqrt(mean_squared_error(y_test, y_test_pred)))
print(r2_score(y_test, y_test_pred))
174 (403, 224) 0.03895489153666429 0.9048085770170703
model_param = list(lasso.coef_)
model_param.insert(0,lasso.intercept_)
cols = df_train.columns
cols.insert(0,'const')
lasso_coef = pd.DataFrame(list(zip(cols,model_param)))
lasso_coef.columns = ['Featuere','Coef']
lasso_coef.sort_values(by='Coef',ascending=False).head(10)
| Featuere | Coef | |
|---|---|---|
| 8 | LowQualFinSF | 0.281284 |
| 1 | LotFrontage | 0.211622 |
| 12 | Fireplaces | 0.090464 |
| 2 | LotArea | 0.089172 |
| 0 | MSSubClass | 0.075230 |
| 6 | BsmtUnfSF | 0.053667 |
| 15 | PoolArea | 0.051527 |
| 3 | OverallQual | 0.045568 |
| 13 | GarageCars | 0.044817 |
| 14 | GarageArea | 0.043813 |
Model 3: Lets perform ridge and Lasso without rfe¶
#Here we will keep the train size as 70% and automatically test size will be rest 30%
#Also we will keep random_state as a fixed value of 100 so that the data set does no changes
df_train,df_test = train_test_split(df_house, train_size = 0.7, random_state = 100)
print ("The Size of Train data is",df_train.shape)
print ("The Size of Test data is",df_test.shape)
Scaler = MinMaxScaler() # Instantiate an objectr
#Note-The above order of columns in num_cols should be same in df_test, otherwise we will get a wrong r-square value
df_train[Num_cols] = Scaler.fit_transform(df_train[Num_cols])
df_test[Num_cols] = Scaler.transform(df_test[Num_cols])
#Define X_train and y_train
y_train = df_train.pop('SalePrice') #This contains only the Target Variable
X_train = df_train #This contains all the Independent Variables except the Target Variable
#Since 'SalePrice' is the target variable we will keep it only on y-train and remove it from X_train
y_test = df_test.pop('SalePrice') #This contains only the Target Variable
X_test = df_test #This contains all the Independent Variables except the Target Variable
The Size of Train data is (940, 225) The Size of Test data is (403, 225)
# Considering following alphas
params = {'alpha': [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0,
9.0, 10.0, 20, 50, 100, 500, 1000 ]}
ridge = Ridge()
# cross validation
folds = 5
ridge_model_cv = GridSearchCV(estimator = ridge,
param_grid = params,
scoring= 'neg_mean_absolute_error',
cv = folds,
return_train_score=True,
verbose = 1)
ridge_model_cv.fit(X_train, y_train)
ridge_cv_results = pd.DataFrame(ridge_model_cv.cv_results_)
ridge_cv_results = ridge_cv_results[ridge_cv_results['param_alpha']<=500]
ridge_cv_results[['param_alpha', 'mean_train_score', 'mean_test_score', 'rank_test_score']].sort_values(by = ['rank_test_score'])
Fitting 5 folds for each of 27 candidates, totalling 135 fits
| param_alpha | mean_train_score | mean_test_score | rank_test_score | |
|---|---|---|---|---|
| 14 | 3.0 | -0.020806 | -0.025553 | 1 |
| 13 | 2.0 | -0.020430 | -0.025573 | 2 |
| 15 | 4.0 | -0.021161 | -0.025599 | 3 |
| 16 | 5.0 | -0.021491 | -0.025705 | 4 |
| 17 | 6.0 | -0.021807 | -0.025842 | 5 |
| 12 | 1.0 | -0.020001 | -0.025907 | 6 |
| 11 | 0.9 | -0.019951 | -0.025985 | 7 |
| 18 | 7.0 | -0.022118 | -0.025992 | 8 |
| 10 | 0.8 | -0.019897 | -0.026080 | 9 |
| 19 | 8.0 | -0.022414 | -0.026167 | 10 |
| 9 | 0.7 | -0.019840 | -0.026206 | 11 |
| 8 | 0.6 | -0.019779 | -0.026355 | 12 |
| 20 | 9.0 | -0.022699 | -0.026366 | 13 |
| 7 | 0.5 | -0.019715 | -0.026531 | 14 |
| 21 | 10.0 | -0.022971 | -0.026571 | 15 |
| 6 | 0.4 | -0.019642 | -0.026746 | 16 |
| 5 | 0.3 | -0.019561 | -0.027049 | 17 |
| 4 | 0.2 | -0.019477 | -0.027454 | 18 |
| 3 | 0.1 | -0.019372 | -0.028063 | 19 |
| 22 | 20 | -0.025374 | -0.028461 | 20 |
| 2 | 0.01 | -0.019243 | -0.029159 | 21 |
| 1 | 0.001 | -0.019231 | -0.029368 | 22 |
| 0 | 0.0001 | -0.019230 | -0.029392 | 23 |
| 23 | 50 | -0.030367 | -0.032800 | 24 |
| 24 | 100 | -0.035664 | -0.037605 | 25 |
| 25 | 500 | -0.050757 | -0.051781 | 26 |
# plotting Negative Mean Absolute Error vs alpha for train and test
ridge_cv_results['param_alpha'] = ridge_cv_results['param_alpha'].astype('int32')
plt.figure(figsize=(8,5))
plt.plot(ridge_cv_results['param_alpha'], ridge_cv_results['mean_train_score'])
plt.plot(ridge_cv_results['param_alpha'], ridge_cv_results['mean_test_score'])
plt.xlabel('alpha parameter')
plt.ylabel('Negative Mean Absolute Error (Ridge regression)')
plt.title("Negative Mean Absolute Error and alpha\n",fontsize=15)
plt.legend(['train score', 'test score'], loc='upper right')
plt.show()
# lambda best estimator
ridge_model_cv.best_estimator_
Ridge(alpha=3.0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Ridge(alpha=3.0)
# Hyperparameter lambda = 1.0
alpha = 3.0
ridge = Ridge(alpha=alpha)
ridge.fit(X_train, y_train)
#ridge.coef_
#Lets calculate the mean squared error value
mse = mean_squared_error(y_train, ridge.predict(X_train))
print("The mean squared error value is ",mse)
# predicting the R2 value of train data
y_train_pred = ridge.predict(X_train)
r2_train = metrics.r2_score(y_true=y_train, y_pred=y_train_pred)
print("The r2 value of train data is ",r2_train)
The mean squared error value is 0.0009472129960129189 The r2 value of train data is 0.9384395525110093
ridge.fit(X_test,y_test)
y_test_pred = ridge.predict(X_test)
print(np.sqrt(mean_squared_error(y_test, y_test_pred)))
print(r2_score(y_test, y_test_pred))
0.03115501423756808 0.9391122777021567
ridge.fit(X_test,y_test)
y_test_pred = ridge.predict(X_test)
print(np.sqrt(mean_squared_error(y_test, y_test_pred)))
print(r2_score(y_test, y_test_pred))
0.03115501423756808 0.9391122777021567
lasso = Lasso()
# Considering following alphas
params = {'alpha': [0.0001, 0.0002, 0.0003, 0.0004, 0.0005, 0.001, 0.002, 0.003, 0.004, 0.005, 0.01]}
# cross validation
folds = 5
lasso_model_cv = GridSearchCV(estimator = lasso,
param_grid = params,
scoring= 'neg_mean_absolute_error',
cv = folds,
return_train_score=True,
verbose = 1)
lasso_model_cv.fit(X_train, y_train)
lasso_cv_results = pd.DataFrame(lasso_model_cv.cv_results_)
lasso_cv_results = lasso_cv_results[lasso_cv_results['param_alpha']<=500]
# plotting Negative Mean Absolute Error vs alpha for train and test
lasso_cv_results['param_alpha'] = lasso_cv_results['param_alpha'].astype('float64')
plt.figure(figsize=(10,8))
plt.plot(lasso_cv_results['param_alpha'], lasso_cv_results['mean_train_score'])
plt.plot(lasso_cv_results['param_alpha'], lasso_cv_results['mean_test_score'])
plt.xlabel('alpha parameter')
plt.ylabel('Negative Mean Absolute Error (Lasso regression)')
plt.title("Negative Mean Absolute Error and alpha (Lasso regression)")
plt.legend(['train score', 'test score'], loc='upper right')
plt.show()
Fitting 5 folds for each of 11 candidates, totalling 55 fits
# lambda best estimator
lasso_model_cv.best_estimator_
Lasso(alpha=0.0001)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Lasso(alpha=0.0001)
# # Hyperparameter lambda = 0001
alpha = 0.0001
lasso = Lasso(alpha=alpha)
lasso.fit(X_train, y_train)
lasso.coef_
array([-3.47046277e-02, 8.01568342e-03, 3.42401618e-02, 1.23318454e-01,
6.82283981e-02, 0.00000000e+00, 3.53450845e-02, 5.59990563e-02,
-0.00000000e+00, -2.58846205e-02, -0.00000000e+00, 2.32952266e-02,
1.53778127e-02, 6.08378698e-03, 4.33094507e-02, 0.00000000e+00,
-0.00000000e+00, 0.00000000e+00, -9.54102984e-04, 2.24896319e-01,
3.38439466e-02, 1.69408311e-02, -6.42945323e-02, -0.00000000e+00,
-1.13202566e-02, 0.00000000e+00, 0.00000000e+00, 1.50910996e-02,
0.00000000e+00, 2.23823656e-02, 1.29270450e-03, -0.00000000e+00,
2.31426375e-03, 0.00000000e+00, -1.73552569e-02, -4.21352878e-03,
0.00000000e+00, 1.13113534e-02, -6.94123098e-03, -0.00000000e+00,
0.00000000e+00, -0.00000000e+00, -0.00000000e+00, 0.00000000e+00,
9.26601542e-05, 1.23042610e-02, 0.00000000e+00, -2.59796427e-03,
2.11858172e-02, -1.09035173e-02, -1.14740138e-02, 0.00000000e+00,
-0.00000000e+00, -2.20161775e-02, -1.13770719e-02, 0.00000000e+00,
-7.45555850e-03, 4.01461357e-02, 4.22125412e-02, -9.74129425e-03,
-0.00000000e+00, -0.00000000e+00, 0.00000000e+00, 2.96928953e-02,
7.36635160e-02, -3.22058254e-03, 1.33742913e-02, 2.84363743e-03,
1.88417586e-02, -0.00000000e+00, 8.82123126e-03, -2.28903958e-02,
2.16786363e-03, 0.00000000e+00, -0.00000000e+00, 0.00000000e+00,
0.00000000e+00, -0.00000000e+00, -0.00000000e+00, -0.00000000e+00,
0.00000000e+00, 0.00000000e+00, -2.71724364e-02, -5.31562868e-03,
-0.00000000e+00, 0.00000000e+00, -4.21334072e-03, -1.84993196e-02,
-0.00000000e+00, 9.45120728e-03, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00, 0.00000000e+00, -1.97645200e-04, 0.00000000e+00,
-0.00000000e+00, 0.00000000e+00, 0.00000000e+00, -0.00000000e+00,
0.00000000e+00, 0.00000000e+00, 0.00000000e+00, -0.00000000e+00,
2.24215195e-02, -0.00000000e+00, 0.00000000e+00, -6.40076705e-03,
-0.00000000e+00, 2.01312931e-03, -5.27984329e-03, 0.00000000e+00,
3.64106922e-03, -0.00000000e+00, -1.34094567e-03, 0.00000000e+00,
0.00000000e+00, -0.00000000e+00, 0.00000000e+00, -0.00000000e+00,
1.57484001e-02, -0.00000000e+00, -0.00000000e+00, 0.00000000e+00,
-0.00000000e+00, -7.07074801e-03, 0.00000000e+00, 0.00000000e+00,
-0.00000000e+00, 0.00000000e+00, -7.85380618e-03, -0.00000000e+00,
-8.37582339e-03, -1.28827846e-02, -0.00000000e+00, -2.68891873e-03,
0.00000000e+00, 4.66932027e-03, 0.00000000e+00, 2.46027533e-03,
-0.00000000e+00, 0.00000000e+00, -0.00000000e+00, -2.58161396e-02,
-4.39301138e-02, -3.78446023e-02, -0.00000000e+00, 1.36415440e-03,
0.00000000e+00, 3.24563031e-03, -0.00000000e+00, 3.19780600e-02,
-3.62154800e-03, -3.97134955e-03, -0.00000000e+00, -0.00000000e+00,
7.45002184e-03, -4.90475851e-03, -3.48629827e-03, -2.77453492e-04,
-0.00000000e+00, 0.00000000e+00, -0.00000000e+00, -0.00000000e+00,
-5.69212437e-03, 2.75652775e-03, -0.00000000e+00, 6.86128403e-03,
-0.00000000e+00, -0.00000000e+00, -0.00000000e+00, 0.00000000e+00,
0.00000000e+00, -2.73074823e-03, 0.00000000e+00, -4.82297695e-03,
-0.00000000e+00, 0.00000000e+00, -0.00000000e+00, -0.00000000e+00,
-1.10429425e-03, -3.04929184e-02, -3.29164145e-02, -3.34439055e-02,
-0.00000000e+00, 0.00000000e+00, -0.00000000e+00, -0.00000000e+00,
-3.89309875e-03, 2.36142689e-02, -0.00000000e+00, -0.00000000e+00,
1.62587974e-02, -0.00000000e+00, 4.09044669e-03, 0.00000000e+00,
-3.26021116e-03, -2.01877549e-03, 0.00000000e+00, -1.07349659e-02,
6.30237225e-04, -0.00000000e+00, -0.00000000e+00, 0.00000000e+00,
-5.82144061e-04, -0.00000000e+00, -0.00000000e+00, 0.00000000e+00,
0.00000000e+00, -1.87813737e-03, 1.23453328e-03, 0.00000000e+00,
0.00000000e+00, -0.00000000e+00, 0.00000000e+00, -0.00000000e+00,
5.52996840e-03, 0.00000000e+00, -5.38114145e-03, 0.00000000e+00,
-0.00000000e+00, -1.28284412e-02, 1.23104872e-02, 3.16934018e-02])
#Lets calculate the mean squared error value
mse = mean_squared_error(y_train, lasso.predict(X_train))
print("The mean squared error value is ",mse)
# predicting the R2 value of train data
y_train_pred = lasso.predict(X_train)
r2_train = metrics.r2_score(y_true=y_train, y_pred=y_train_pred)
print("The r2 value of train data is ",r2_train)
The mean squared error value is 0.0009707838136888879 The r2 value of train data is 0.9369076583225618
# Create a list of columns to drop from X_test
columns_to_drop = [col for col in X_test.columns if col not in X_train.columns]
print(len(columns_to_drop))
print(X_test.shape)
X_test_rfe = X_test.drop(columns=columns_to_drop)
lasso.fit(X_test_rfe,y_test)
y_test_pred = lasso.predict(X_test_rfe)
print(np.sqrt(mean_squared_error(y_test, y_test_pred)))
print(r2_score(y_test, y_test_pred))
0 (403, 224) 0.02966170849560128 0.9448092691620592
10. Conclusion :¶
The optimal value of LAMBDA we got in case of Ridge and Lasso is :
- Ridge - 3.0
- Lasso - 0.0001
Assignment Part - II ( Finding solutions for the subjective questions, Please refer the PDF file for the complete answers)¶
Question 1 : What will be the changes in the model if you choose double the value of alpha for both ridge and lasso?
#Lets find for Ridge first
alpha = 3.0 # Optimal value of alpha is 3
ridge = Ridge(alpha=alpha)
ridge.fit(X_train, y_train)
print("The output when alpha is 4: ")
mse = mean_squared_error(y_train, ridge.predict(X_train))
print("The mean squared error value of train data is ",mse)
mse = mean_squared_error(y_test, ridge.predict(X_test))
print("The mean squared error value of test data is ",mse)
r2_train = metrics.r2_score(y_true=y_train, y_pred=y_train_pred)
print("The r2 value of train data is ",r2_train)
r2_test = metrics.r2_score(y_true=y_test, y_pred=y_test_pred)
print("The r2 value of test data is ",r2_test)
print()
The output when alpha is 4: The mean squared error value of train data is 0.0009472129960129189 The mean squared error value of test data is 0.0015129109188580304 The r2 value of train data is 0.9369076583225618 The r2 value of test data is 0.9448092691620592
alpha = 6.0
ridge = Ridge(alpha=alpha)
ridge.fit(X_train, y_train)
ridge.coef_
print("The output when alpha is 6: ")
mse = mean_squared_error(y_test, ridge.predict(X_test))
print("The mean squared error value is ",mse)
r2_train = metrics.r2_score(y_true=y_train, y_pred=y_train_pred)
print("The r2 value of train data is ",r2_train)
r2_test = metrics.r2_score(y_true=y_test, y_pred=y_test_pred)
print("The r2 value of test data is ",r2_test)
The output when alpha is 6: The mean squared error value is 0.0015295259041306838 The r2 value of train data is 0.9369076583225618 The r2 value of test data is 0.9448092691620592
#Let's create a ridge model with alpha = 4.0
ridge_doubled = Ridge(alpha = 4.0)
ridge_doubled.fit(X_train,y_train)
y_train_ridge_pred_doubled = ridge_doubled.predict(X_train)
y_test_ridge_pred_doubled = ridge_doubled.predict(X_test)
ridge_coef_doubled_df = pd.DataFrame(ridge_doubled.coef_ , columns = ['Coefficient'], index = X_train.columns)
print("Top predictor features for ridge when alpha is 6 are :\n")
print(ridge_coef_doubled_df.sort_values(by = 'Coefficient', ascending = False).head(10))
Top predictor features for ridge when alpha is 6 are :
Coefficient
Total_sqr_footage 0.134351
OverallQual 0.088469
TotalBsmtSF 0.079235
Neighborhood_StoneBr 0.058877
TotRmsAbvGrd 0.046711
Total_Bathrooms 0.045971
OverallCond 0.044430
GarageArea 0.044175
LotArea 0.038149
Neighborhood_NoRidge 0.035671
#Now lets calculate for Lasso
alpha = 0.0001 #Optimal Value of alpha
lasso = Lasso(alpha=alpha)
lasso.fit(X_train, y_train)
lasso.coef_# mse
print("The output when alpha is 0.0001: ")
#Lets calculate the mean squared error value
mse = mean_squared_error(y_test, lasso.predict(X_test))
print("The mean squared error value is ",mse)
#predicting the R2 value on train data
y_train_pred = lasso.predict(X_train)
r2_train = metrics.r2_score(y_true=y_train, y_pred=y_train_pred)
print("The r2 value of train data is ",r2_train)
#predicting the R2 value on test data
y_test_pred = lasso.predict(X_test)
r2_test = metrics.r2_score(y_true=y_test, y_pred=y_test_pred)
print("The r2 value of test data is ",r2_test)
print()
alpha = 0.0002 #Optimal Value of alpha
lasso = Lasso(alpha=alpha)
lasso.fit(X_train, y_train)
lasso.coef_# mse
print("The output when alpha is 0.0002: ")
#Lets calculate the mean squared error value
mse = mean_squared_error(y_test, lasso.predict(X_test))
print("The mean squared error value is ",mse)
#predicting the R2 value on train data
y_train_pred = lasso.predict(X_train)
r2_train = metrics.r2_score(y_true=y_train, y_pred=y_train_pred)
print("The r2 value of train data is ",r2_train)
#predicting the R2 value on test data
y_test_pred = lasso.predict(X_test)
r2_test = metrics.r2_score(y_true=y_test, y_pred=y_test_pred)
print("The r2 value of test data is ",r2_test)
The output when alpha is 0.0001: The mean squared error value is 0.0014895373420533808 The r2 value of train data is 0.9369076583225618 The r2 value of test data is 0.9065616382631766 The output when alpha is 0.0002: The mean squared error value is 0.0014958564828966149 The r2 value of train data is 0.9313084900646047 The r2 value of test data is 0.9061652398975188
#Let's create a lasso model with alpha = 0.0002
lasso_doubled = Lasso(alpha=0.0002)
lasso_doubled.fit(X_train,y_train)
y_train_pred_doubled = lasso_doubled.predict(X_train)
y_test_pred_doubled = lasso_doubled.predict(X_test)
lasso_coef_doubled_df = pd.DataFrame(lasso_doubled.coef_ , columns = ['Coefficient'], index = X_train.columns)
print("Top correlated features of Lasso when alpha is 0.0002 are:\n")
print(lasso_coef_doubled_df.sort_values(by = 'Coefficient', ascending = False).head(10))
Top correlated features of Lasso when alpha is 0.0002 are:
Coefficient
Total_sqr_footage 0.225682
OverallQual 0.133786
Neighborhood_StoneBr 0.065950
TotalBsmtSF 0.060540
OverallCond 0.058713
Neighborhood_NridgHt 0.044229
GarageArea 0.043697
Neighborhood_NoRidge 0.035138
SaleCondition_Partial 0.032983
BsmtExposure_Gd 0.030938
Question 3 :After building the model, you realised that the five most important predictor variables in the lasso model are not available in the incoming data. You will now have to create another model excluding the five most important predictor variables. Which are the five most important predictor variables now?
X_train.info()
<class 'pandas.core.frame.DataFrame'> Index: 940 entries, 1146 to 858 Columns: 224 entries, MSSubClass to SaleCondition_Partial dtypes: float64(25), int64(199) memory usage: 1.6 MB
alpha = 0.0001 #Optimal Value of alpha
lasso = Lasso(alpha=alpha)
lasso.fit(X_train, y_train)
lasso_pred = pd.DataFrame(lasso_doubled.coef_ , columns = ['Coefficient'], index = X_train.columns)
print(lasso_pred.sort_values(by = 'Coefficient', ascending = False).head(10))
top_5_predictors = lasso_pred['Coefficient'].nlargest(5).index
X_train_dropped = X_train.drop(columns=top_5_predictors)
#print(X_train_dropped.columns)
print('r2 score',metrics.r2_score(y_true=y_train, y_pred= lasso.predict(X_train)))
Coefficient Total_sqr_footage 0.225682 OverallQual 0.133786 Neighborhood_StoneBr 0.065950 TotalBsmtSF 0.060540 OverallCond 0.058713 Neighborhood_NridgHt 0.044229 GarageArea 0.043697 Neighborhood_NoRidge 0.035138 SaleCondition_Partial 0.032983 BsmtExposure_Gd 0.030938 r2 score 0.9369076583225618
#Lasso prediction by dropping top5 predictors
lasso.fit(X_train_dropped, y_train)
lasso_pred = pd.DataFrame(lasso.coef_ , columns = ['Coefficient'], index = X_train_dropped.columns)
y_train_pred = lasso.predict(X_train_dropped)
print('Applying Lasso after dropping 5 predictor variables')
print(lasso_pred.sort_values(by = 'Coefficient', ascending = False).head(10))
print('r2 score',metrics.r2_score(y_true=y_train, y_pred=y_train_pred))
Applying Lasso after dropping 5 predictor variables
Coefficient
TotRmsAbvGrd 0.129455
Total_Bathrooms 0.112029
GarageArea 0.105183
Fireplaces 0.051469
LotArea 0.048502
Street_Pave 0.047245
Neighborhood_NoRidge 0.044297
BsmtExposure_Gd 0.042970
BsmtUnfSF 0.040704
SaleCondition_Partial 0.033489
r2 score 0.9050034157050509